Discovering Hierarchy in Reinforcement Learning with HEXQ. Hengst. ICML 2002.

  1. HEXQ attempts to decompose and solve factored MDPs in a model-free manner
  2. Doesn’t deal with solving the decomposed MDP, just doing the decomposition itself
  3. Assumes:
    1. Some elements of the feature vector change less frequently than others
    2. The variables that change more often keep transition properties independently of more slowly changing variables
    3. “Interface between regions can be controlled.  For example, if a robot navigates around four equally sized rooms with interconnecting doorways (…) the state space can be represented by the two variables, room-identifier and position-in-room.  Most representations naturally label repeated sub-structures in this way.” <Seems to mean partitioning the feature vector along any particular dimension solely causes reasonable partitions of the state space>
  4. If these assumptions don’t hold, HEXQ will worst-case simply solve the flat problem
  5. Creates a hierarchy, with the maximum number of levels being the state dimension.  The bottom (level 1) is the variables that change most frequently
    1. “The rationale is that sub-tasks that are used most often appear at the lower levels and need to be learnt first.”
  6. Only the first level (fastest) interacts with the environment via primitive actions
  7. Start by observing the state variable that changes most frequently. “We now partition the states represented by the values of this variable into Markov regions.”  <I take this simply to mean regions of the flat state space corresponding to each possible assignment of that feature to a valid value, although not sure>
  8. “The boundaries between regions are identified by ‘unpredictable’ (see subsection 4.2) transitions which we call region exits.  We then define sub-MDPs over these regions and learn separate policies to leave each region via its various exits.”
  9. Regions are then combined with the next most frequently changing feature to make more abstract states at the next level in the hierarchy
    1. The exit policies then become abstract actions at that next level
    2. This results in a semi-MDP with one less feature in the state dimension and only abstract actions (exits, not primitive)
  10. The top-level has a sub-MDP that is solved by recursively calling lower-level MDP policies (exits) as its actions
  11. The feature ordering heuristic simply involves running a random trajectory and finding the frequency at which each feature changes
  12. Tries to model transitions as a directed graph <DBN?>, and random trajectories are taken.  “Following a set period of exploration in this manner, transitions that are unpredictable (called exits) are eliminated from the graph.”
  13. Transitions are unpredictable if:
    1. T or R is not stationary
    2. Some other variable changes value
    3. The terminal goal state is reached
  14. Entry states are reached after taking an exit (which is a state action pair <s^e, a>, where e is the level of the state)
  15. <Didn’t finish reading>

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: