Causal Graph Based Decomposition of Factored MDPs. Jonsson, Barto. JMLR 2006

Nice, but complex and long.  Definitely need at least 1 more read to really get everything here.

  1. Algorithm called variable influence structure analysis (VISA), which performs hierarchical decomposition of factored MDPs
  2. Hierarchically develops actions that cause components of the state to change
  3. They call options activities
  4. Also mention Hierarchical Abstract Machines (HAMs), options, and MAQX
  5. A treasure trove of references in here, worth mining.
  6. The HEXQ algorithm is based on a similar concept of getting particular components of the state vector to change, but doesn’t use a DBN to identify preconditions (works with less information, therefore probably less efficient)
    1. HEXQ also works on factored MDPs though
    2. Uses something called  a frequency of change heuristic?
    3. Does not always recreate the correct dependencies
  7. If two state components influence each other, it is not possible to isolate changes to just one of them (this corresponsd to cycles in the causal graph).   Therefore VISA identifies strongly connected components and treats all nodes in that component as a single node
  8. After reducing connected components out, it identifies activities (options) that cause a particular state feature to change
    1. At the top level the activity is the MDP itself
  9. Not yet clear if it takes in a DBN or if it builds one
  10. Why are they talking about stochastic policies?  They aren’t necessary and just complicate the math
  11. Their example is the coffee robot
  12. Non-synchronic DBN
  13. Mentions SMDP Q-Learning for learning with options.  (I hope this isn’t the method they use for the empirical results. – It is.)
  14. Partitioning of state space is defined such that transitions and rewards are the same within the partition
    1. Maybe not in terms of second part – claim that partitioning doesn’t maintain reward of original MDP
    2. Only has to respect reward in terms of the options themselves
  15. VISA takes in a DBN
  16. For each connected component, identifies exits, which cause change in one feature involved in that component
    1. By DBN, focus only on features that can be involved in those exits
  17. After identifying exits some simplifications may be performed depending on effects of various exits
  18. An exit option is an exit followed by an action.  Not yet sure why this is important
  19. Constructs a tree to determine initialization states for each option
    1. Other ways to determine reachability as well, although they argue to use their proposal because most of the work is done in other parts of the algorithm already
  20. Somewhat clearly, options terminate when exit is reached
  21. The algorithm doesn’t directly know how to compute a policy for each option, but includes in its reasoning only variable values involved in the context, or are directly related to those in terms of preconditions
  22. Not groking completely, but there two levels of partitioning (Y and Z?), the first maintains optimality, but the second apparently may not?  Their argument is that the savings in size (and computation) that is achieved the second time around is worth the potential loss of optimality
  23. Represents policies as trees
  24. There is an additional option called the task option that which is intended to approximate a solution to the original MDP.  It represents a hierarchical solution
    1. The task option is the only reward-dependent component
  25. The task option along with exit options implicitly define a hierarchy of options
  26. VISA may merge some strongly connected components with parent
  27. RL is used to define the policy of each option
  28. If the whole task graph is one strongly connected component, this algorithm will not work
  29. Has recursive optimality as opposed to hierarchical optimality? (don’t know what the distinction is)
  30. Works best when one or just a small number of state components can be changed at one time
  31. “State abstraction for the task option is particularly efficient in tasks for which the expected reward depends on only a few state variables.  If most state variables influence reward, learning the task option policy requires almost the same effort as learning a policy over primitive actions.”
  32. The exit option tries to change features by applying as few actions as possible, which is not necessarily optimal (may not be too hard to make it minimize cost though)
  33. Compared to 2 other algorithms that assume access to DBN (SPUDD and symbolic RTDP).
    1. SPUDD is like policy iteration but more efficient due to knowledge from DBN
    2. sRTDP uses SPUDD as a lower lever function
  34. Unlike VISA and sRTDP, SPUDD is a pure planning algorithm (haven’t figured out how VISA isn’t one as well, perhaps option policies are learned online?)
  35. Tested in coffee, taxi, and a factory task and an autonomous guided vehicle task
  36. I suppose SPUDD preserves optimality? says VISA generally produces much smaller representations of problems than SPUDD partially because it gives up on optimality. In many cases the representation is less than half the size
    1. If the algorithm could be run with just the mapping from X->Y and without Y->Z (one way that it gives up on optimality is in this latter mapping which makes it more compact)
  37. The representations are smaller and computation is much faster
  38. SPUDD has to enumerate all states while VISA doesnt?
    1. Computational cost is poly in the size of the “conditional probability trees of the DBN model” whats that specifically?
  39. “Each stand-alone task only distinguishes among values of a subset of the state variables, which means that complexity of learning does not necessarily increase with the number of state variables.”
  40. Ah I see VISA doesn’t leverage the transition probabilities and uses QL.  Leveraging the transition probabilities would allow for other more powerful planners to be used.
  41. Then they move onto how to construct option models without enumerating all states
  42. Talk about how to create partitions that maintain reward and transition distributions
  43. Some aspects of SPUDD can be borrowed in VISA (solving the reduced MDP)
  44. They also introduce an estimation of a option model that is known to be incorrect in certain ways, but again argues that the computational savings that are achieved are worth the loss in accuracy
  45. A number of exensions for the multi-time model which I’m not sure is relevant at the moment
  46. More related work at end, relevant

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: