- Presents HI-MAT (Hierarchy Induction via Models And Trajectories)
- Discovers MAXQ hierarchies by applying DBN models to successful trajectories
- “HI-MAT discovers subtasks by analyzing the causal and temporal relationships among the actions in the trajectory.”
- Is safe (under assumptions) and compact
- Results automatically found are comparable to hand made decompositions
- MAXQ develops a task hierarchy with relevant task variables for representing the value of the overall task (creating value functions for each subtask). Learning the values of each component in the decomposition is simpler than learning the global value function.
- Hierarchical methods as good for transfer
- “In this paper, we focus on the asymmetric knowledge transfer setting where we are given access to solved source RL problems. The objective is to derive useful biases from these solutions that could speed up learning in target problems.”
- Given a DBN and a trajectory, creates a
*causally annotated trajectory*(CAT) - Based on CAT<s?> defines MAXQ subtasks
- Using successful trajectories ends up working better than just using DBNs
- <Although this probably undermines the case for transfer made earlier, as criteria for success is a something that can change when considering transfer>
- Although then says it also helps transfer

- In the MAXQ framework, have state and action sets, along with a task hierarchy H, which is a dag representing the task
- Leaf nodes in H correspond to primitive actions

- Each subtask T in H is a <X,S,G,C> tuple where:
- X is the relevant state features
- S is the admissible set of states
- G defines termination/goal
- C defines the child tasks of this node

- T can be invoked any time the state is in S, and terminates when G is satisfied
- The “local policy” for T is a mapping from states in the task to children
- The global policy is a mapping of all T to a child
- The hierarchically optimal policy for a MAXQ graph is the hierarchical policy that has best expected reward
- A hierarchical policy is recursively optimal if the local policy in each T is optimal given that all its children are also recursively optimal
- Mentions HEXQ and VISA
- Based on changing values of state variables

- HEXQ uses a heuristic that orders state variables <features?> based on how often a change in their value leads to a subtask being completed
- VISA uses DBNs to analyze the influence of state variables on each other, and is more principled than HEXQ
- Variables are partitioned so there is acyclic influence between the variables in different clusters (strongly connected components)
- “Here, state variables that influence others are associated with lower-level subtasks.”

- Difference between VISA and HI-MAT (the algorithm for this paper) is the addition of successful trajectory information in addition to DBNs to construct heirarchies
- Empirical evidence that HI-MAT makes hierarchies that are exponentially more compact than VISA
- <I am confident that you could construct domains where they would work out the same, but the claim that in practice successful trajectories help makes a ton of sense as the algorithm can focus only on the parts of the entire exponentially sized state space that help solve the problem>

- An earlier paper by the author tries to generate hierarchies from trajectories, but without the DBN.
- Constrain consideration to stochastic shortest-path problems (“a known conjunctive goal”)
- Assume DBN represents conditional probabilites as trees <is this the normal representation?>
- Partitioning is done recursively working backwards from the goal
- Konidaris also does this, but in continuous spaces Says they are ” given
**a**[my emphasis] successful trajectory that reaches the goal in the source MDP.” <Really just one? I imagine if that is the case they will generate more trajectories themselves, especially because the domain is stochasitc>

- Konidaris also does this, but in continuous spaces Says they are ” given
- “With this in hand, our objective is to automatically induce a MAXQ hierarchy that can suitably constrain the policy space when solving a related target problem, and therefore achieve faster convergence in the target problem. This is achieved via recursive partitioning of the given trajectory into subtasks using a top-down parse guided by backward chaining from the goal. We use the DBNs along with the trajectory to define the termination predicate, the set of subtasks, and the relevant abstraction for each MAXQ subtask.”
- Empirical results on Taxi
- A variable <feature> is
*relevant*to an action*a*if R or T either check or change that variable. If its not relevant it is declared*irrelevant* *trajectory-relevant*variables are those that are checked or changed during the entire trajectory- A causal edge of variable v goes from actions a to b (a -v->b, where b follows a in the trajectory) iff v is trajectory-relevant to both a and b and irrelevant to all actions in between
- <I’m only reading about actions, where does state come into play? Or is it wrapped into their definition of action somehow>

- Special actions END and START allow for causal edges from the start and to the terminal action <state?>
- A causally annotated trajectory (CAT) “is the original trajectory annotated with all the causal, source, and sink edges.”
- Has cycles removed

- Define
*literal on the causal edge*the value v has in a-v->b before b is executed - Then compute something they call the DBN-closure. Not groking how this works 100% <partially because they seem to have unorthodox naming conventions here – also when they say variable to they men the value of it or the feature…>, but for a feature look at variables that influence it, until no new variables can be added to this set. Do the same thing, treating the reward and goal as a feature as well
- DBN-closure(GOAL) is the set of all features that influence the goal

- <Mostly skipping the exact description of the algorithm>
- For any goal that isn’t solved, find the corresponding subtask and where it exists in the CAT
- Then the literals on the causal edges that enter it are added to the unsolved goals <but when is the CAT segmented again? It must be before this.>
- <The partition is discussed at the end of the description, but it seems like it must happen first?>
- For every subtask in the partition, it has necessary fields defined for it in the tree H (children, termination condition, relevant features)
- A little bit on how primitive actions may be added to leaves even if they didnt exist in the trajectory <It really seems like the algorithm just makes due with one trajectory – how does this get reasonable coverage, especially in the case of stochastic domains which is considered in the paper>
- A DBN is called
*maximally sparse*if any relationship that is removed from the DBN would change resulting distributions (if unnecessary features were in the DBN they could be removed without altering the predictions of the model) - Primitive subtasks are just 1 action long <I guess I should have anticipated that>
- A trajectory is called
*non-redundant*if no subsequence of actions in the trajectory can be removed with the remaining sequence still reaching the goal <I guess this the removal of cycles? Still – this is a bit strange in the stochastic setting> - If a trajectory is non-redundant, HI-MAT will produce a task hierarchy that is consistent with the trajectory <skipping the proof>
- A hierarchy H is
*safe*wrt DBN models M if the state variables capture the value of any trajectory consistent with the sub-hierarchy rooted at that task node <I think basically this says that X, the features defined for a substask are sufficient to predict the expected reward of the trajectory? Not the easiest paper to read, perhaps because of space restrictions> - Assuming totally reasonable things: “If the DBN models are maximally sparse then the maximum size of the value function table for any task in the hierarchy produced by HI-MAT is the smallest over all safe hierarchies which are consistent with the trajectory.”
- For a trajectory of length L, the subtasks are bounded by 2L <why is that?> And therefore the value can be represented in an O(L) factor over optimal, as opposed to size exponential in the dimension which occurs with a flat representation.
- “Our analysis does not address state abstractions arising from the so-called funnel property of subtasks where many starting states result in a few terminal states. Funnel abstractions permit the parent task to ignore variables that, while relevant inside the child task, do not affect the terminal state.”
- Moving to empirical results
- Bitflip domain
- VISA has an exponentially sized hierarchy even after merging (and learns quite slowly)
- <Just to mention, HI-MAT starts with a sucessful trajectory and VISA does not (which is a big difference, I think Tom shows that a successful trace is the difference between tractable and intractable learning in factored domains), so keep in mind they aren’t really at parity>
- Ah: “This domain has been engineered to highlight the case when access to a successful trajectory allows for significantly more compact hierarchies than without. We expect that access to a solved instance will usually improve the compactness of the resulting hierarchy.>

- Then results with transfer, Taxi and Wargus <they got it working!?>
- In some algorithms there is “negative transfer” as the old policy actually makes it harder to learn the new
- In Wargus, HI-MAT is able to converge faster than the manually constructed hierarchy as it is able to find a more efficient representation
- Doesn’t work for disjunctive goals?