Should really look for another description of the algorithm if it exists. For some reason I’m finding the paper very unclear on my first read through. Its one of the few papers I’ve read that would be more understandable with more math and less English.
- HEXQ attempts to decompose and solve factored MDPs in a model-free manner
- Doesn’t deal with solving the decomposed MDP, just doing the decomposition itself
- Assumes:
- Some elements of the feature vector change less frequently than others
- The variables that change more often keep transition properties independently of more slowly changing variables
- “Interface between regions can be controlled. For example, if a robot navigates around four equally sized rooms with interconnecting doorways (…) the state space can be represented by the two variables, room-identifier and position-in-room. Most representations naturally label repeated sub-structures in this way.” <Seems to mean partitioning the feature vector along any particular dimension solely causes reasonable partitions of the state space>
- If these assumptions don’t hold, HEXQ will worst-case simply solve the flat problem
- Creates a hierarchy, with the maximum number of levels being the state dimension. The bottom (level 1) is the variables that change most frequently
- “The rationale is that sub-tasks that are used most often appear at the lower levels and need to be learnt first.”
- Only the first level (fastest) interacts with the environment via primitive actions
- Start by observing the state variable that changes most frequently. “We now partition the states represented by the values of this variable into Markov regions.” <I take this simply to mean regions of the flat state space corresponding to each possible assignment of that feature to a valid value, although not sure>
- “The boundaries between regions are identified by ‘unpredictable’ (see subsection 4.2) transitions which we call region exits. We then define sub-MDPs over these regions and learn separate policies to leave each region via its various exits.”
- Regions are then combined with the next most frequently changing feature to make more abstract states at the next level in the hierarchy
- The exit policies then become abstract actions at that next level
- This results in a semi-MDP with one less feature in the state dimension and only abstract actions (exits, not primitive)
- The top-level has a sub-MDP that is solved by recursively calling lower-level MDP policies (exits) as its actions
- The feature ordering heuristic simply involves running a random trajectory and finding the frequency at which each feature changes
- Tries to model transitions as a directed graph <DBN?>, and random trajectories are taken. “Following a set period of exploration in this manner, transitions that are unpredictable (called exits) are eliminated from the graph.”
- Transitions are unpredictable if:
- T or R is not stationary
- Some other variable changes value
- The terminal goal state is reached
- Entry states are reached after taking an exit (which is a state action pair <s^e, a>, where e is the level of the state)
- In Taxi, all states are entries because the agent is reset randomly after the goal state is reached <But doesn’t this criteria not jive with their assumptions that the domains are basically shortest path tasks?>
- An abstract state can only be left by an exit
- Approach:
- Decompose the transition graph into strongly connected components (SCCs)
- The SCCs form a DAG
- SCCs can then be merged together, potentially in a hierarchy – goal is to do as much of this as possible in order to minimize the # of abstract states
- “Following the coalition of SCCs into regions we may still have regions connected by edges from the underlying DAG. We break these by forming additional exits and entries associated with their respective regions and repeat the entire procedure until no additional regions are formed.” <Not exactly sure what this means yet>
- “A region, therefore, is a combination of SCCs such that any exit state in a region can be reached from any entry with probability 1. Regions are generally aliased in the environment.”
- Regions have sub-goals of reaching exits
- “We proceed to construct multiple sub-MDPs one for each unique hierarchical exit state (s1, s2, …, se) in each region. Sub-MDP policies in HEXQ are learnt on-line, but a form of hierarchical dynamic programming could be used directly as the sub-task models have already been uncovered.”
- A region may have multiple exits, and each exit may lead to a separate sub-MDP <But those sub-MDPs are on the same level according to the wording which I don’t understand:> “In the taxi example, the above procedure finds one hierarchical level 1 region as reflected in figure 2. This region has 8 exits… As there are 4 hierarchical exit states we create 4 sub-MDPs at level 1 and solve them.”
- Taking an abstract action at level e is equivalent to executing the policy for the appropriate region at level e-1
- In taxi, level 1 has just one state. Level 2 has 5 states <corresponding to the 4 passenger locations plus the 1 state at level 1?>. The abstract actions involve being at one of the locations where a passenger is and either doing a pickup or a putdown <It seems like relevant aspects of state are ignored though, so it just has to do with pickup or putdown in a particular location on the map, regardless of whether a passenger is in the car or not, or what the destination is? Oh, it seems like information at the next level up encodes whether there is a passenger in the car or not>.
- <The trick to go from deterministic shortest path problems to stochastic shortest path problems feels a bit hacky. Basically, they keep long and short term statistics recorded, and see if they match or not>
- In taxi, HexQ ends up with a representation not much larger than MaxQ, while learns relevant hierarchical information itself
- Exits must work with certainty, how to generalize the algorithm to stochastic case is set for future work
- “The two heuristics employed by HEXQ are (1) variable ordering and (2) finding non-stationary transitions.”
- A problem is that “An independent slowly changing random variable would be sorted to the top level in the hierarchy by this heuristic and the MDP would fail to decompose as it is necessary to explore the variable’s potential impact.”
- HEXQ, like MAXQ is recursively optimal