Reinforcement Learning with Hierarchies of Machines. Parr, Russell. NIPS 1997

  1. Policies considered are constrained by hierarchies of partially specified machines
  2. Is intended to allow for incorporation of prior knowledge and create a smaller search space
  3. Algorithms presented are provably convergent
  4. Uses HAMs or hierarchical abstract machines, which constrain policies that can be selected and whose transitions may invoke lower level machines
  5. A HAM is a program which constrains the set of actions the agent can choose from in each state
    1. In a maze navigation task it may select a fixed action until the end of a hallway is encountered
    2. This machine in turn can call lower level machines to do other tasks like simple obstacle avoidance
  6. Mentioned this architecture is similar to Brooks’ subsumption architechture
  7. At the highest level, this example machine just selects a direction at intersections, but other operation is automatic
    1. A level below this is four machines, each traverse a hallway in a particular direction
    2. Below this, a number of strategies for getting around obstacles while going in that direction exist
    3. Finally are the primitive actions
    4. The machine also uses a richer feature representation that describes near obstacles and hallway intersections
  8. The idea is that a HAM applied to an MDP composes that MDP into a new domain.  Essentially:
    1. The new set of states is the cross product of states in H (the HAM) and M (the MDP)
    2. For each state in HoM where H is an action state (there are two types of states in H), the transition functions of H and M are combined
    3. For each stat in HoM where H is a choice state, actions only change machine state
    4. The rewards are taken only from primitive actions in M
  9. Policies that are optimal in HoM correspond to an optimal policy in M, and there is a way to derive the optimal policy for M from the optimal policy for HoM
  10. The complexity of HoM is the number of choice states in HoM
    1. This will usually be smaller than the number of choices that have to be selected for M
  11. They introduce a variant of QL that learns directly in the reduced space, but doesn’t have to perform the explicit creation of HoM
  12. Keeps track of
    1. Current state in H, as well as M
    2. The last action selected
    3. The states of H and M at the time the last action was selected
    4. The reward (as well as discount) accumulated since the last action selection
    5. The Q-table is extended as a SxMxA (actual states x machine states x actions)
  13. The algorithm converges to optimal
  14. This stuff is orthogonal to state aggregation, which can introduce partial observability issues if not done carefully (domain must also have certain features to allow aggregation to happen safely)
  15. A number of references worth reading, may not be relevant but important to thoroughly check whats out there
  16. Says the options framework is a subset of what is described here

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: