Nice, but complex and long. Definitely need at least 1 more read to really get everything here.

- Algorithm called variable influence structure analysis (VISA), which performs hierarchical decomposition of factored MDPs
- Hierarchically develops actions that cause components of the state to change
- They call options activities
- Also mention Hierarchical Abstract Machines (HAMs), options, and MAQX
- A treasure trove of references in here, worth mining.
- The HEXQ algorithm is based on a similar concept of getting particular components of the state vector to change, but doesn’t use a DBN to identify preconditions (works with less information, therefore probably less efficient)
- HEXQ also works on factored MDPs though
- Uses something called a frequency of change heuristic?
- Does not always recreate the correct dependencies

- If two state components influence each other, it is not possible to isolate changes to just one of them (this corresponsd to cycles in the causal graph). Therefore VISA identifies strongly connected components and treats all nodes in that component as a single node
- After reducing connected components out, it identifies activities (options) that cause a particular state feature to change
- At the top level the activity is the MDP itself

- Not yet clear if it takes in a DBN or if it builds one
- Why are they talking about stochastic policies? They aren’t necessary and just complicate the math
- Their example is the coffee robot
- Non-synchronic DBN
- Mentions SMDP Q-Learning for learning with options. (I hope this isn’t the method they use for the empirical results. – It is.)
- Partitioning of state space is defined such that transitions and rewards are the same within the partition
- Maybe not in terms of second part – claim that partitioning doesn’t maintain reward of original MDP
- Only has to respect reward in terms of the options themselves

- VISA takes in a DBN
- For each connected component, identifies
*exits*, which cause change in one feature involved in that component- By DBN, focus only on features that can be involved in those exits

- After identifying exits some simplifications may be performed depending on effects of various exits
- An exit option is an exit followed by an action. Not yet sure why this is important
- Constructs a tree to determine initialization states for each option
- Other ways to determine reachability as well, although they argue to use their proposal because most of the work is done in other parts of the algorithm already

- Somewhat clearly, options terminate when exit is reached
- The algorithm doesn’t directly know how to compute a policy for each option, but includes in its reasoning only variable values involved in the context, or are directly related to those in terms of preconditions
- Not groking completely, but there two levels of partitioning (Y and Z?), the first maintains optimality, but the second apparently may not? Their argument is that the savings in size (and computation) that is achieved the second time around is worth the potential loss of optimality
- Represents policies as trees
- There is an additional option called the task option that which is intended to approximate a solution to the original MDP. It represents a hierarchical solution
- The task option is the only reward-dependent component

- The task option along with exit options implicitly define a hierarchy of options
- VISA may merge some strongly connected components with parent
- RL is used to define the policy of each option
- If the whole task graph is one strongly connected component, this algorithm will not work
- Has recursive optimality as opposed to hierarchical optimality? (don’t know what the distinction is)
- Works best when one or just a small number of state components can be changed at one time
- “State abstraction for the task option is particularly efficient in tasks for which the expected reward depends on only a few state variables. If most state variables influence reward, learning the task option policy requires almost the same effort as learning a policy over primitive actions.”
- The exit option tries to change features by applying as few actions as possible, which is not necessarily optimal (may not be too hard to make it minimize cost though)
- Compared to 2 other algorithms that assume access to DBN (SPUDD and symbolic RTDP).
- SPUDD is like policy iteration but more efficient due to knowledge from DBN
- sRTDP uses SPUDD as a lower lever function

- Unlike VISA and sRTDP, SPUDD is a pure planning algorithm (haven’t figured out how VISA isn’t one as well, perhaps option policies are learned online?)
- Tested in coffee, taxi, and a factory task and an autonomous guided vehicle task
- I suppose SPUDD preserves optimality? says VISA generally produces much smaller representations of problems than SPUDD partially because it gives up on optimality. In many cases the representation is less than half the size
- If the algorithm could be run with just the mapping from X->Y and without Y->Z (one way that it gives up on optimality is in this latter mapping which makes it more compact)

- The representations are smaller and computation is much faster
- SPUDD has to enumerate all states while VISA doesnt?
- Computational cost is poly in the size of the “conditional probability trees of the DBN model” whats that specifically?

- “Each stand-alone task only distinguishes among values of a subset of the state variables, which means that complexity of learning does not necessarily increase with the number of state variables.”
- Ah I see VISA doesn’t leverage the transition probabilities and uses QL. Leveraging the transition probabilities would allow for other more powerful planners to be used.
- Then they move onto how to construct option models without enumerating all states
- Talk about how to create partitions that maintain reward and transition distributions
- Some aspects of SPUDD can be borrowed in VISA (solving the reduced MDP)
- They also introduce an estimation of a option model that is known to be incorrect in certain ways, but again argues that the computational savings that are achieved are worth the loss in accuracy
- A number of exensions for the multi-time model which I’m not sure is relevant at the moment
- More related work at end, relevant

Advertisements