A Causal Approach to Hierarchical Decomposition in Reinforcement Learning. Anders Jonsson. Dissertation 2006.

  1. Considers planning in domains with factored state
  2. Goal is to hierarchical learning with do state abstraction for each subtask
  3. Because function approximation is problematic, instead focuses on hierarchical decomposition (temporal abstraction) and state abstraction
  4. Three models of “activities” (temporally extended actions) in RL
    1. Hierarchical Abstract Machines / HAMs (Parr, Russell)
    2. Options (Sutton et al)
    3. MAXQ (Dietterich)
  5. Starts with description of the H-Tree algorithm, which is based on McCallum’s U-tree algorithm (which isn’t designed for options, and was instead designed for POMDPs)
    1. Results show that option-specific state abstraction makes learning faster
  6. Next is VISA, which uses causal relationships (DBNs) to find hierarchical decomposition
    1. VISA uses the DBN to create a causal graph that describes how to get features to change
    2. “The VISA algorithm is useful in tasks for which the values of key state variables change relatively infrequently”
  7. HEXQ determines causal relationships by using the heuristic of how often state variables change
  8. It identifies “exits”, which are <s,a> pairs that cause the values of state variables to change.
  9. VISA tries to identify exits as well, but because it is working with more information (stronger assumptions) it is able to make more accurate and effective decomposition than HEXQ
  10. “The VISA algorithm exploits sparse conditional dependence between state v ariables such that the causal graph contains two or more strongly connected components. In addition, the algorithm is more efficient when changes in the values of state variables occur relatively infrequently . How likely is a realistic task to display this type of structure? It is impossible for us to determine the percentage of tasks in which the aforementioned structure is present. However, our intuition tells us that it is not uncommon to encounter tasks that display this structure.”
  11. In the DBN model, there is one DBN for each action
    1. <I would put the action in the same DBN as an additional input; this is more general (works for continuous actions for one) and is more compact in cases where action selection is highly (or completely) constrained>
  12. Its possible to learn DBNs from trajectory data, from an episodic generative model
    1. Their algorithm uses the Bayesian Information Criterion 
  13. Each of the four algorithms can be viewed as part of a larger system, which would:
    1. Learn a DBN
    2. Do hierarchical decomposition from the DBN
    3. Do (and then refine) state abstraction for each option
    4. Construct a compact representation of each option
    5. Learn the policy of each option, and the task option <the high-level policy?>
  14. The first step has the dominating computational cost
  15. Algorithms aren’t presented in this order, but in order they were developed
  16. The U-Tree algorithm is for POMDPs, keeps track of (a finite number) of past observations, and attempts to use that history to predict the next observation
  17. It then treats each leaf in the tree as a state in the MDP, and makes predictions about the probability of transitioning from leaf to leaf, and does Q estimates on leaf-action pairs
  18. Like normal decision trees, starts out as a stump and expands the tree when the data warrants it (with a Kolmogorov-Smirnov test)
  19. Now onto the H(ierarchical)-tree
  20. Its for semi-POMDPs
  21. Requires only minor changes, such as keeping track of option durations in the history
  22. Uses SMDP Q-learning over the leaves
  23. A criticism of U-tree is that you don’t know how to choose the length of the window for the history
    1. In H-Tree, this is less of a problem because options give much richer information, so in many problems only a couple of steps of history is needed
    2. <This doesn’t really solve the problem, it is perhaps a bit of a mitigation.  It seems elegant to me to do the same K-S test that is done on whether to grow the tree to also determine whether the history should be grown or not.>
  24. Does intra-option learning in the sense that an option that is common to two superoptions with be represented once
  25. Says its for POMDPs, but I’ve never seen it used for POMDPS <ah mentions this in the empirical section with TAXI – because its fully observable it just uses the current state with no other history>
  26. Epsilon-softmax exploration <you can’t do directed exploration with options, cause it just ends up slowing you down, although that proof didn’t exist at the time of this paper>
  27. There is faster learning and convergence to a better result when intra-option learning is used as opposed to leaving intra-option learning out <probably because the change in temperature for exploration goes too fast for learning without options>
  28. On the other hand, a plot with options vs flat shows flat learning takes a little longer to get close to convergence (about twice as long), but its also to a better result.
    1. This is blamed on the incorrect abstraction done by the tree
    2.  Also claim that epsilon-greedy exploration is worse for options <don’t but that – its true that the wrong option will push you far away, but the right option gets you very close to the goal, and that is what really matters.  With flat, you have to get extremely lucky with a series of right choices till you get to the goal>
  29. Admittedly, the algorithm has too many parameters
  30. Now onto VISA, which develops options and state abstraction from a DBN
  31. Builds on ideas from HEX-Q.  Whereas HEXQ uses samples to try and figure out dependencies on how to get a particular feature to change with an “exit” (based on orderings of how frequently features change), VISA uses a DBN to do so
  32. In the coffee task, for example, HEX-Q produces a linear hierarchy that is not representative of the actual dynamics  <I’m not sure if it always does this purely linear relationship or if it can represent general DBNs given enough data>
  33. Visa’s structure cares only about changing the value of features

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: