Category Archives: Hierarchical

Discovering Hierarchy in Reinforcement Learning with HEXQ. Hengst. ICML 2002.

  1. HEXQ attempts to decompose and solve factored MDPs in a model-free manner
  2. Doesn’t deal with solving the decomposed MDP, just doing the decomposition itself
  3. Assumes:
    1. Some elements of the feature vector change less frequently than others
    2. The variables that change more often keep transition properties independently of more slowly changing variables
    3. “Interface between regions can be controlled.  For example, if a robot navigates around four equally sized rooms with interconnecting doorways (…) the state space can be represented by the two variables, room-identifier and position-in-room.  Most representations naturally label repeated sub-structures in this way.” <Seems to mean partitioning the feature vector along any particular dimension solely causes reasonable partitions of the state space>
  4. If these assumptions don’t hold, HEXQ will worst-case simply solve the flat problem
  5. Creates a hierarchy, with the maximum number of levels being the state dimension.  The bottom (level 1) is the variables that change most frequently
    1. “The rationale is that sub-tasks that are used most often appear at the lower levels and need to be learnt first.”
  6. Only the first level (fastest) interacts with the environment via primitive actions
  7. Start by observing the state variable that changes most frequently. “We now partition the states represented by the values of this variable into Markov regions.”  <I take this simply to mean regions of the flat state space corresponding to each possible assignment of that feature to a valid value, although not sure>
  8. “The boundaries between regions are identified by ‘unpredictable’ (see subsection 4.2) transitions which we call region exits.  We then define sub-MDPs over these regions and learn separate policies to leave each region via its various exits.”
  9. Regions are then combined with the next most frequently changing feature to make more abstract states at the next level in the hierarchy
    1. The exit policies then become abstract actions at that next level
    2. This results in a semi-MDP with one less feature in the state dimension and only abstract actions (exits, not primitive)
  10. The top-level has a sub-MDP that is solved by recursively calling lower-level MDP policies (exits) as its actions
  11. The feature ordering heuristic simply involves running a random trajectory and finding the frequency at which each feature changes
  12. Tries to model transitions as a directed graph <DBN?>, and random trajectories are taken.  “Following a set period of exploration in this manner, transitions that are unpredictable (called exits) are eliminated from the graph.”
  13. Transitions are unpredictable if:
    1. T or R is not stationary
    2. Some other variable changes value
    3. The terminal goal state is reached
  14. Entry states are reached after taking an exit (which is a state action pair <s^e, a>, where e is the level of the state)
  15. <Didn’t finish reading>
Advertisements
Tagged

Intrinsically Motivated Hierarchical Skill Learning in Structured Environments. Vigorito, Barto. IEEE Transactions on Autonomous Mental Development 2010

  1. Covers intrinsic motivation for learning in hierarchical environments
  2. “Using Bayesian network structure-learning techniques and structured dynamic programming algorithms, we show that reinforcement learning agents can learn incrementally and autonomously both the causal structure of their environment and a hierarchy of skills that exploit this structure.”
  3. Motivation is “ensemble” <what I think of as transfer> learning
  4. Mentions structured value iteration for VI in domains with compact representations
  5. Factored representations lead to sparseness
  6. Naturally, mention VISA by Jonsson, Barto
  7. But VISA needs a model of the domain to work from (in the form of DBN).  The goal here is to do model building so that assumption can be removed
  8. Produces results that are recursively optimal but may not be hierarchically optimal – full solution may be suboptimal, even though policy for each option is optimal
  9. Optimize DBN according to Bayes Information Criterion (BIC)
    1. Doing this optimally is intractable, they use a greedy heuristic
    2. Basically involves building decision trees using BIC (use a chi-squared test)
  10. Actions are purely explorative; are selected s.t. leaf (leaves) of <s,a> maximize change in entropy in leaf
  11. In original VISA, algorithm <ironically> doesn’t explore thoroughly enough, because action selection is myopic in terms of improving knowledge of DBN only as far as current state is concerned, so there is no real directed exploration in the way that RMAX does, mention a few possible fixes <although its not clear if any of them are “smart” in an RMAX sense>
    1. Mention work by Schmidhuber for intrinsic motivation (and its problems) <but why not RMAX as it does things correctly?>
    2. Then mention stuff by Barto that is better, but isn’t designed for factored domains
    3. <Oh, they will go into KWIK, Carlos’ stuff later in related work>
  12. To do learning, maintain a set C of what are called controllable variables, which are variables which the agent knows how to set to any possible value that feature can take
  13. When choosing an action, look for features that may be changed if more information is achieved.  Then make sure that features ancestors in the DBN are controllable.  If so, set up a plan to try and change the feature
  14. So how do you know when to stop refining your model for a feature in stochastic environments?
  15. <Its a little unclear to me how this happens.  There is some expectation for the probability that the agent will be able to change a feature to a particular value.  If it fails to do so with a ratio a certain value less than that, it is abandoned?  This doesn’t make sense so I must not understand it correctly>
  16. Looks like they do the RMAX trick, but the feature they pick to assign value to is the one that has controllable sources and the highest potential change in entropy <I thought Bayes info – is it the same?>
  17. “When options happen to be created prematurely and are malformed, their lack of utility is discovered fairly quickly by the agent when it attempts to use those  options in its experimental plans and they fail repeatedly. These options will be removed from the agent’s skill set until the agent performs more experiments relevant to discovering their structure, at which point they will be re-created and tested in further experiments. Once a correct option is learned, its empirical success rate will on average match its expected success rate, and the option will remain in the agent’s skill set to be used in
    all further experiments.”
  18. Experimental results
  19. The light box domain has 20 “lights” or binary variables: ~1 million raw states, 20 million <s,a> pairs
    1. Lights are separated into categories
      1. Circular, which are controlled directly by their switch
      2. Triangular, which are turned on if a particular set of circular lights are on (with own respective switch)
      3. Rectangular which depend on triangular (with own respective switch)
      4. Diamond which depend on rectangular (with own respective switch)
    2. Stochastic; actions “work” with p=0.9
    3. “However, if an action is taken to toggle a light whose dependencies are not currently satisfied, the entire domain is reset to all lights being off.”
      1. No way random actions can get anywhere
    4. Learning must be active so that exploration can be done in an intelligent directed manner
  20. <Seems to learn effective, but not “correct” policies, dependencies that are flat (1,2,3 needed on to turn on 4, but 1,2,3, independent of each other) end up being constructed serially so it looks like they are dependent (1, then 2, then 3 – even though they could be done in any order)>
  21. Not surprisingly, global (directed) exploration is the only thing that works in this domain
  22. Planning times for planning with options+primitives vs just primitives is options is flat with increasing problem complexity (level in the hierarchy — circular, triangular, rectangular, diamond) while primitives only has exponentially increasing planning cost with increasing complexity
  23. Mention KWIK algorithms for DBNs, knock them for having limited (and exponential cost in) in-degree, but you must have that in general problems to get an optimal solution – the greedy approach here only works in submodular domains

In Transit to Constant Time Shortest-Path Queries in Road Networks. Bast, Funke, Matijevic, Sanders, Schultes. Algorithm Engineering and Experiments (ALENEX)

  1. Beats previous best results by two orders of magnitude <was also published in Science, but the article is so short I’m not taking notes on it>
  2. In worst-case Dijkstras cannot be below linear because it may have to examine all nodes and vertices
    1. The only way to beat this is by preprocessing
  3. Constant query time can be achieved by superlinear storage <Upon looking up that reference, it really looks linear.  The response may be slightly suboptimal (but this amount is constant in the size of the graph)>
  4. The improvements in quality and speed here exploit the fact that roads have particular properties, like low branching factors, fact that highways are designed to be used for long distance travel…
  5. Idea of transit node is “that for every pair of nodes that are ‘not too close’ to each other, the shortest path between them passes through at least one of these transit nodes.”
    1. Additionally, even when travelling very far, the number of transit nodes that must be used is very small (like 10)
  6. In the US road data they consider, 24 million nodes, and 10,000 transit nodes
  7. Example (paper has figure): “Finding the optimal travel time between two points (flags) somewhere between Saarbr¨ucken and Karlsruhe amounts to retrieving the 2 × 4 access nodes (diamonds), performing 16 table lookups between all pairs of access nodes, and checking that the two disks defining the locality filter do not overlap. Transit nodes that are not relevant for the depicted query are drawn as small squares.”
  8. Mentions that there are special classes of algorithms for working on planar graphs, which roads (basically) are
  9. They have a previous paper that hierarchically builds paths based on increasingly long paths <should check out http://algo2.iti.kit.edu/schultes/hwy/esa06HwyHierarchies.pdf>
    1. This paper is by the same authors, and ends up doing something very similar to transit nodes (they are just nodes high in the hierarchy).  In the earlier paper access nodes were computed on the fly whereas here they are precomputed
    2. It also doesn’t deal with how to figure out the distance in the distance table is indeed the shortest distance
  10. A* is actually one of the best methods for real world planning (when done around landmarks and the the triangle inequality).  Here, access nodes can be used as the landmarks
  11. The hierarchy approach also needs to be grounded in actual geographic information to work <at small scales?>
  12. Nodes are labeled either close/not close to individual access nodes
  13. Access nodes are found by backwards search by Dijkstra’s, and do this until all paths converge on another transit node, then take all nodes found through this search that do not reach another transit node
  14. There are a number of ways to implement a locality filter; the one they recommend is to examine everything that is reachable within a cluster in the hierarchy
  15. This information allows for what is effectively constant time queries for shortest paths
  16. There is another way to implement the algorithm that is grid based – uses 2 grids of different resolution
    1. Needs some optimization to function efficiently
    2. Computing access nodes in grid-based implementation is very simple
  17. With the optimizations, long-distance queries via transit nodes is actually faster (1 million x) than local queries.  Although on average local queries that dont go through access nodes are about 1% of queries <not sure how this is measured>  it still dominates the running time
    1. There is a scheme also that allows speedups for local queries as well
  18. In the grid setting, there are tradeoffs between space, number of transit nodes <both increasing with resolution> and #local queries that don’t go through transit nodes <decreasing with resolution>
  19. Ah, to get the best of both worlds they use a hierarchy of grids
  20. The exact algorithm used to compute access nodes and transit nodes is confusing, and their diagram doesn’t help either.  Its written in English but would be much more understandable written in math (or at least pseudocode)
  21. Queries are super-easy to run (if nodes are more than 4 cells apart, path must go through a transit node, which already have pairwise distances computed)
  22. Can do hierarchical <seems to be just multiple griddings of different resolutions but I think there is another paper that may go into it more thoroughly>
  23. Do a greedy set cover to optimize storage for access nodes
  24. Implemented on data with 24m nodes, 60m edges
  25. Run on a ’07 dual-proc machine, 99% of queries have an avg time of 12 microseconds (adding the extra 1% moves it to 63 msec – are those the ones that don’t go through transit nodes?)
  26. Computation times are not “Dijkstra rank dependent”
  27. <Ok that’s not the end of the paper – just for the grid part, now moving on to a more fully hierarchical method>
  28. Highway hierarchies
  29. An edge is a highway edge is an edge that is in the shortest path between two vertices, but not in the local neighborhood of those vertices
  30. They also may contract edges based on a bypassbility criterion, but that isn’t yet discussed at length
  31. The hierarchy is then constructed by recursively building highway networks on top of each preceeding layer
  32. With increasing path distance, paths will go increasingly higher in the highway hierarchy
  33. So the nodes existing at some level in the hierarchy can be used as the transit nodes
  34. #transit nodes controlled by neighborhood size and level of hierarchy
  35. “Note that there is a difference between the level of the highway hierarchy and the layer of transit node search” <on this read through the distinction is actually not clear to me – looks like they are basically equivalent but in reverse order?>
  36. There is a gloss over how description of how access nodes are found – another paper is referenced <which I will try and get to> and point #20 is referenced
  37. Present 2 methods of the algorithm, which have different tradeoffs in terms of space, preprocessing, and query time
  38. Search is done top-down
  39. “For a given node pair (s, t), in order to get a complete description of the shortest s-t-path, we first perform a transit node query and determine the layer i that is used to obtain the shortest path distance. Then, we have to determine the path from s to the forward access node u to layer i, the path from the backward access node v to t, and the path from u to v”
  40. There is still room for improvement for the locality filters
  41. For US data, 99% of cases are handled at the top layer
  42. Search times start low and increase as searches are still local but further away, and then drop extremely low once nodes are far enough that transit nodes can be leveraged
  43. <On to conclusion>
  44. “Building on highway hierarchies, this can be achieved using a moderate amount of additional storage and precomputation but with an extremely low query time. The geometric grid approach on the other hand allows for very low space consumption at the cost of  slightly higher preprocessing and query times.”
  45. “There are many interesting ways to choose transit nodes. For example nodes with high node reach [9, 6] could be a good starting point. Here, we can directly influence |T |, and the resulting reach bound might help defining a simple locality filter. However, it seems thatgeometric reach or travel time reach do not reflect the inhomogeneous density of real world road networks. Hence, it would be interesting if we could efficiently approximate reach based on the Dijkstra rank.
    Another interesting approach might be to start with some locality filter that guarantees uniformly small local searches and to view it as an optimisation problem to choose a small set of transit nodes that cover all the local search spaces.”

Basis Function Construction for Hierarchical Reinforcement Learning. Osentoski, Mahadevan. AAMAS 2010.

  1. For semi-MDPs
  2. “By combining methods for automatic basis construction with HRL methods, such as MAXQ, we show how a hierarchical function approximator can be automatically formed based on a specific task hierarchy.”
  3. Spectral (graph Laplacian) approach that works from trajectory information (nodes are states and edges are actions – assumes deterministic?)
  4. Taxi, VFs can be constructed by conditioning on particular elements of feature vector (like a decision tree)
  5. Task hierarchy composes a SMDP into smaller, simpler subtasks
  6.  “Our approach is divided into three steps.  The first step constructs a graph for each subtask from the agent’s experience.  We show how graph construction can leverage the abstractions provided with the task hierarchy.  The second step constructs a reduced graph based upon the graph structure.  The third step recursively constructs basis functions using basis functions from child subtasks as well as using the eigenvectors of the graph Laplacian from the reduced graph.”
  7. Because spectral methods are undirected, they just keep track of state transitions and basically throw out the actions
  8. There is a simple method of merging vertices (called a reduced graph).
  9. Then eigenvectors of this <?>
  10. Consider 3 types of abstraction:
    1. Subtask irrelevance: eliminate state variables that are irrelevant to subtask.  This occurs from the reduced graph construction: “If the state variables in Y_i have no bearing on the probability transition function, they will be irrelevant in terms of connectivity on the graph and only X_i will be used to represent the state variables.”
    2. Shielding: If there is some state that causes a path to some option to terminate <prematurely?> it wont be represented
    3. Result distribution irrelevance: If the output depends of an option depends only on a particular feature <?>
  11. Empirical results in taxi and a manufacturing domain

Should really look for another description of the algorithm if it exists.  For some reason I’m finding the paper very unclear on my first read through.  Its one of the few papers I’ve read that would be more understandable with more math and less English.

  1. HEXQ attempts to decompose and solve factored MDPs in a model-free manner
  2. Doesn’t deal with solving the decomposed MDP, just doing the decomposition itself
  3. Assumes:
    1. Some elements of the feature vector change less frequently than others
    2. The variables that change more often keep transition properties independently of more slowly changing variables
    3. “Interface between regions can be controlled.  For example, if a robot navigates around four equally sized rooms with interconnecting doorways (…) the state space can be represented by the two variables, room-identifier and position-in-room.  Most representations naturally label repeated sub-structures in this way.” <Seems to mean partitioning the feature vector along any particular dimension solely causes reasonable partitions of the state space>
  4. If these assumptions don’t hold, HEXQ will worst-case simply solve the flat problem
  5. Creates a hierarchy, with the maximum number of levels being the state dimension.  The bottom (level 1) is the variables that change most frequently
    1. “The rationale is that sub-tasks that are used most often appear at the lower levels and need to be learnt first.”
  6. Only the first level (fastest) interacts with the environment via primitive actions
  7. Start by observing the state variable that changes most frequently. “We now partition the states represented by the values of this variable into Markov regions.”  <I take this simply to mean regions of the flat state space corresponding to each possible assignment of that feature to a valid value, although not sure>
  8. “The boundaries between regions are identified by ‘unpredictable’ (see subsection 4.2) transitions which we call region exits.  We then define sub-MDPs over these regions and learn separate policies to leave each region via its various exits.”
  9. Regions are then combined with the next most frequently changing feature to make more abstract states at the next level in the hierarchy
    1. The exit policies then become abstract actions at that next level
    2. This results in a semi-MDP with one less feature in the state dimension and only abstract actions (exits, not primitive)
  10. The top-level has a sub-MDP that is solved by recursively calling lower-level MDP policies (exits) as its actions
  11. The feature ordering heuristic simply involves running a random trajectory and finding the frequency at which each feature changes
  12. Tries to model transitions as a directed graph <DBN?>, and random trajectories are taken.  “Following a set period of exploration in this manner, transitions that are unpredictable (called exits) are eliminated from the graph.”
  13. Transitions are unpredictable if:
    1. T or R is not stationary
    2. Some other variable changes value
    3. The terminal goal state is reached
  14. Entry states are reached after taking an exit (which is a state action pair <s^e, a>, where e is the level of the state)
    1. In Taxi, all states are entries because the agent is reset randomly after the goal state is reached <But doesn’t this criteria not jive with their assumptions that the domains are basically shortest path tasks?>
  15. An abstract state can only be left by an exit
  16. Approach:
    1. Decompose the transition graph into strongly connected components (SCCs)
    2. The SCCs form a DAG
    3. SCCs can then be merged together, potentially in a hierarchy – goal is to do as much of this as possible in order to minimize the # of abstract states
    4. “Following the coalition of SCCs into regions we may still have regions connected by edges from the underlying DAG.  We break these by forming additional exits and entries associated with their respective regions and repeat the entire procedure until no additional regions are formed.”  <Not exactly sure what this means yet>
  17. “A region, therefore, is a combination of SCCs such that any exit state in a region can be reached from any entry with probability 1.  Regions are generally aliased in the environment.”
  18. Regions have sub-goals of reaching exits
  19. “We proceed to construct multiple sub-MDPs one for each unique hierarchical exit state (s1, s2, …, se) in each region. Sub-MDP policies in HEXQ are learnt on-line, but a form of hierarchical dynamic programming could be used directly as the sub-task models have already been uncovered.”
  20. A region may have multiple exits, and each exit may lead to a separate sub-MDP <But those sub-MDPs are on the same level according to the wording which I don’t understand:> “In the taxi example, the above procedure finds one hierarchical level 1 region as reflected in figure 2.  This region has 8 exits… As there are 4 hierarchical exit states we create 4 sub-MDPs at level 1 and solve them.”
  21. Taking an abstract action at level e is equivalent to executing the policy for the appropriate region at level e-1
  22. In taxi, level 1 has just one state.  Level 2 has 5 states <corresponding to the 4 passenger locations plus the 1 state at level 1?>.  The abstract actions involve being at one of the locations where a passenger is and either doing a pickup or a putdown  <It seems like relevant aspects of state are ignored though, so it just has to do with pickup or putdown in a particular location on the map, regardless of whether a passenger is in the car or not, or what the destination is?  Oh, it seems like information at the next level up encodes whether there is a passenger in the car or not>.
  23. <The trick to go from deterministic shortest path problems to stochastic shortest path problems feels a bit hacky.  Basically, they keep long and short term statistics recorded, and see if they match or not>
  24. In taxi, HexQ ends up with a representation not much larger than MaxQ, while learns relevant hierarchical information itself
  25. Exits must work with certainty, how to generalize the algorithm to stochastic case is set for future work
  26. “The two heuristics employed by HEXQ are (1) variable ordering and (2) finding non-stationary transitions.”
  27. A problem is that “An independent slowly changing random variable would be sorted to the top level in the hierarchy by this heuristic and the MDP would fail to decompose as it is necessary to explore the variable’s potential impact.”
  28. HEXQ, like MAXQ is recursively optimal

A Causal Approach to Hierarchical Decomposition in Reinforcement Learning. Anders Jonsson. Dissertation 2006.

  1. Considers planning in domains with factored state
  2. Goal is to hierarchical learning with do state abstraction for each subtask
  3. Because function approximation is problematic, instead focuses on hierarchical decomposition (temporal abstraction) and state abstraction
  4. Three models of “activities” (temporally extended actions) in RL
    1. Hierarchical Abstract Machines / HAMs (Parr, Russell)
    2. Options (Sutton et al)
    3. MAXQ (Dietterich)
  5. Starts with description of the H-Tree algorithm, which is based on McCallum’s U-tree algorithm (which isn’t designed for options, and was instead designed for POMDPs)
    1. Results show that option-specific state abstraction makes learning faster
  6. Next is VISA, which uses causal relationships (DBNs) to find hierarchical decomposition
    1. VISA uses the DBN to create a causal graph that describes how to get features to change
    2. “The VISA algorithm is useful in tasks for which the values of key state variables change relatively infrequently”
  7. HEXQ determines causal relationships by using the heuristic of how often state variables change
  8. It identifies “exits”, which are <s,a> pairs that cause the values of state variables to change.
  9. VISA tries to identify exits as well, but because it is working with more information (stronger assumptions) it is able to make more accurate and effective decomposition than HEXQ
  10. “The VISA algorithm exploits sparse conditional dependence between state v ariables such that the causal graph contains two or more strongly connected components. In addition, the algorithm is more efficient when changes in the values of state variables occur relatively infrequently . How likely is a realistic task to display this type of structure? It is impossible for us to determine the percentage of tasks in which the aforementioned structure is present. However, our intuition tells us that it is not uncommon to encounter tasks that display this structure.”
  11. In the DBN model, there is one DBN for each action
    1. <I would put the action in the same DBN as an additional input; this is more general (works for continuous actions for one) and is more compact in cases where action selection is highly (or completely) constrained>
  12. Its possible to learn DBNs from trajectory data, from an episodic generative model
    1. Their algorithm uses the Bayesian Information Criterion 
  13. Each of the four algorithms can be viewed as part of a larger system, which would:
    1. Learn a DBN
    2. Do hierarchical decomposition from the DBN
    3. Do (and then refine) state abstraction for each option
    4. Construct a compact representation of each option
    5. Learn the policy of each option, and the task option <the high-level policy?>
  14. The first step has the dominating computational cost
  15. Algorithms aren’t presented in this order, but in order they were developed
  16. The U-Tree algorithm is for POMDPs, keeps track of (a finite number) of past observations, and attempts to use that history to predict the next observation
  17. It then treats each leaf in the tree as a state in the MDP, and makes predictions about the probability of transitioning from leaf to leaf, and does Q estimates on leaf-action pairs
  18. Like normal decision trees, starts out as a stump and expands the tree when the data warrants it (with a Kolmogorov-Smirnov test)
  19. Now onto the H(ierarchical)-tree
  20. Its for semi-POMDPs
  21. Requires only minor changes, such as keeping track of option durations in the history
  22. Uses SMDP Q-learning over the leaves
  23. A criticism of U-tree is that you don’t know how to choose the length of the window for the history
    1. In H-Tree, this is less of a problem because options give much richer information, so in many problems only a couple of steps of history is needed
    2. <This doesn’t really solve the problem, it is perhaps a bit of a mitigation.  It seems elegant to me to do the same K-S test that is done on whether to grow the tree to also determine whether the history should be grown or not.>
  24. Does intra-option learning in the sense that an option that is common to two superoptions with be represented once
  25. Says its for POMDPs, but I’ve never seen it used for POMDPS <ah mentions this in the empirical section with TAXI – because its fully observable it just uses the current state with no other history>
  26. Epsilon-softmax exploration <you can’t do directed exploration with options, cause it just ends up slowing you down, although that proof didn’t exist at the time of this paper>
  27. There is faster learning and convergence to a better result when intra-option learning is used as opposed to leaving intra-option learning out <probably because the change in temperature for exploration goes too fast for learning without options>
  28. On the other hand, a plot with options vs flat shows flat learning takes a little longer to get close to convergence (about twice as long), but its also to a better result.
    1. This is blamed on the incorrect abstraction done by the tree
    2.  Also claim that epsilon-greedy exploration is worse for options <don’t but that – its true that the wrong option will push you far away, but the right option gets you very close to the goal, and that is what really matters.  With flat, you have to get extremely lucky with a series of right choices till you get to the goal>
  29. Admittedly, the algorithm has too many parameters
  30. Now onto VISA, which develops options and state abstraction from a DBN
  31. Builds on ideas from HEX-Q.  Whereas HEXQ uses samples to try and figure out dependencies on how to get a particular feature to change with an “exit” (based on orderings of how frequently features change), VISA uses a DBN to do so
  32. In the coffee task, for example, HEX-Q produces a linear hierarchy that is not representative of the actual dynamics  <I’m not sure if it always does this purely linear relationship or if it can represent general DBNs given enough data>
  33. Visa’s structure cares only about changing the value of features

The Role of the Ventromedial Prefrontal Cortex in Abstract State-Based Inference During Decision Making in Humans. Hampton, Bossaerts, O’Doherty. Journal of Neuroscience 2006.

  1. Finding in the paper is that activity in ventromedial prefrontal cortex (vmPFC) is more consistent with hierarchical than flat RL
  2. “These results suggest that brain regions, such as vmPFC, use an abstract model of task structure to guide behavioral choice, computations that may underlie the human capacity for complex social interactions and abstract strategizing.”
  3. In the task studied, the task flips in the middle so “…contingencies will reverse.”
    1. In particular, there are two actions that can be selected from, both with stochastic payoff.  Once the better action is selected 4 times straight, the payoff distributions switch (so the bad arm becomes the good one)
    2. This is a POMDP, where people need to infer the hidden state dynamics in order to play well
  4. PFC is tasked with encoding high-order structure, and is associated with higher cognitive functions such as working memory, planning, and decision making, and also seems to be involved in encoding abstract rules
  5. Goal is to see if PFC activity “… would correlate better with an abstract state-based decision algorithm than with simple RL.” <I don’t yet understand what the distinction between these two are but I’m sure we’ll get there>
  6. Of the RL models tested, Q-learning with softmaxy selection had the best fit to subject data <I’m not sure that means their methodology was bad/lacking RL algos, or if people are just terrible>
  7. The “abstract state-based model” is a Bayesian hidden state Markov model
  8. The way the model is set up is that: “The decision to switch is implemented on the basis of the posterior probability that the last choice was incorrect.”
    1. “The state-based model predicts the subjects’ actual choice behavior (whether to switch or not) with an accuracy of  92 +/- 2%”
  9. Subjects made the right choice 61 +/- 2% of the time <this isn’t so much better than chance>
    1. “This is also close to the optimal performance of the state-based model that was 64% (using the actual task parameters).” <Is it that hard?>
  10. <I don’t really like their terminology (they call the “prior” the posterior before reward signal and the “posterior” the postererior after the reward signal)>
  11. <Anyway,> “The prior correct signal” was found to correlate with medial PFC (mPFC), adjacent OFC, and amygdala
  12. Difference in what they call prior/posterior (reward prediction error) lead to signal in ventral striatum as is found elsewhere
  13. “The prior correct signal from the state-based model is almost identical to the expected reward signal from the RL model.  Nevertheless, our paradigm permits sharp discrimination between the two models.  The predictions of the two models differ immediately after a switch in the subjects’ action choice.”
    1. Basically, the MDP planner (non-pomdp) just uses average rewards of the two actions, whereas the pomdp planner tries to differentiate action quality based on hidden state
  14. At p < 0.01, “…abstract state-based decision making may be especially localized to the vmPFC.”
  15. “The critical distinction between the state-based inference model and standard RL is what happens to the expected value of the newly chosen stimulus after subjects switch.  According to standard RL, the expected value of the new choice should be low, because that was the value it had when the subject had previously stopped selecting it (usually after receiving monetary losses on that stimulus).  In contrast, the state-based algorithm predicts that the expected value for the newly chosen action should be high, because unlike standard RL, it incorporates the knowledge that when one action is low in value, the other is high.”
    1. “… the expected value signal in the vmPFC jumps up even before a reward is delivered on the newly chosen action… updating seems to occur using previous knowledge of the task structure.”
  16. “The final decision whether to switch or stay was associated with activity in the anterior cingulate cortex and anterior insula, consistent with previous reports of a role for these regions in behavioral control (…).  These regions are in close proximity to areas that were significantly correlated with the prior probability that the current choice was incorrect as provided by the decision model.  A plausible interpretation of these findings is that the anterior insular and anterior cingulate cortex may actually be involved in using information about the inferred choice probabilities to compute the decision itself.”

Motivation of Extended Behaviors by Anterior Cingulate Cortex. Holroyd, Yeung. Trends in Cognitive Sciences 2012

  1. Proposes that anterior cingulate cortex (ACC) performs selection and “maintenance” <learning?> of options
    1. This view can be seen as a way of tying together two different believed functions of ACC
    2. One theory is that ACC is seen as being related to motivation and starting behavior (lesions to ACC may lead to what seems to be outwardly catatonic behavior such as not responding to questions or speech even though physically the person is completely capable of doing so)
    3. The other is that it is thought to be related to cognitive control and RL
  2. Prior theories of ACC function actually are related to the deployment of cognitive control, as opposed to cognitive control itself
  3. Argue that differences in findings is due to narrow examination of deficits
  4. “In this Opinion we argue that ACC supports the selection and execution of coherent behaviors over extended periods…” which is hierarchical RL
  5. “In this view, ACC is more concerned with the selection and maintenance of the task itself than with the minutiae of task execution.  Thus, ACC would be responsible for engaging in a psychology experiment until its completion as opposed to implementing subtle behavioral adjustments along the way.”
  6. Previous work by Matt proposes that much of hierarchical control can be done in prefrontal cortex “… a region widely believed to be involved in supporting task sets […].”
    1. In this work they argue for ACC doing this instead of PFC
  7. Based on actor-critic model, but this work specifically proposes that ACC is “at the apex of bot pathways.”
    1. ACC selects and maintains options
    2. dorsolateral prefrontal cortex  and motor structures execute options (the actor)
    3. Orbitofrontal cortex and ventral striatum (the critic) evaluates progress to the terminating state of the option
  8. “… the dorsal striatum implements the policy of the actor, for example, by stopping the car at red lights…”
  9. Ventral striatum is the critic does long term credit assignment for actions selected
  10. “… oribtofrontal cortex provides the ventral striatum with information related to abstract goals, consistent with its role in contextually based action and reward evaluation […]; this information affords the basal ganglia flexibility to learn not only about primary rewards and punishments … but also about goal-related outcomes.”  <not clear to me what they mean exactly by this distinction>
  11. A few problems with current models
    1. Generally models have basal ganglia learning stimulus-response mappings, but this explanation seems to simplistic to describe how we can go about complex real world behaviors, as something else would be needed to get learning to scale.
    2. “… the architecture does not specify what task DLPFC should implement nor what goal orbitofrontal cortex should take as appropriate for the current task context.”
    3. Doesn’t determine “… the degree of vigor with which the task should be executed.”
  12. The ACC (and its proposed function in terms of HRL) removes these problems
  13. “…ACC decides what task to perform and then directs DLPFC to implement that task, which in turn provides top-down biasing signals to the dorsal striatum to facilitate execution of the chosen policy.  Thus whereas ACC selects the option-specific policy, DLPFC and the dorsal striatum together execute that policy… Furthermore, orbitofrontal cortex associates the termination state of each option with pseudo-reward, providing contextually appropriate reward information to the ventral striatum…”
  14. ACC also dictates how much effort should be expended in executing the option, and when to terminate the option
    1. If ACC doesn’t push much, basal ganglia will dominate behavior with actions selected at the low-level it functions on
  15. ACC also works on dopamine
  16. Theories hold that both phasic (short) and tonic (long) dopamine signals are used for shaping behavior
  17. From the HRL model,
    1. Low dopamine levels do option selection: facilitating “… the gating of a high-valued option into working memory…”
    2. High dopamine levels  do option maintenance: maintaining “… information in working memory until the option is completed …” 
  18. Further evidence of ACC for higher level behavior is patients with ACC lesions have mostly unimpared functioning in tasks where behavior is simple immediate stimulus-response tasks
    1. Conversely, lesions in ACC are associated with the lack of initialization of more effortful or spontaneous behavior.  The slowdown in initialization occurs because ACC is most active when starting an option
  19. Consistent with proposal ACC also activates when switching between high-level tasks of a problem

Mechanisms of Hierarchical Reinforcement Learning in Cortico-Striatal Circuits 1: Computational Analysis. Badre, Frank. Cerebral Cortex 2011.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3278315/

  1. “Growing evidence suggests that the prefrontal cortex (PFC) is organized hierarchically, with more anterior regions having increasingly abstract representations. How does this organization support hierarchical cognitive control and the rapid discovery of abstract action rules? We present computational models at different levels of description. “
  2. In the neural model, basal ganglia gate frontal actions, striatal units gate inputs to PFC, and others gate output to response selection
  3. Learning at all levels in the hierarchy is done through dopamine
  4. “This functionality allows the system to exhibit conditional if–then hypothesis testing and to learn rapidly in environments with hierarchical structure.”
  5. Also present a Bayesian RL mixture of experts (MOE) model, which finds the most likely hypothesis state based on their history
    1. It is “… intended to correspond to key computational features of the neural model.”
  6. “This model yields accurate probabilistic estimates about which hypotheses are attended by manipulating attentional states in the generative neural model and recovering them with the MoE model.”
  7. In a followup paper (although seems to be published contemporaneously) they look for neural correlates to this model
  8. There is evidence that frontal cortex manages hierarchical control, monitoring progress in super-and sub-tasks
  9. An earlier fMRI study on hierarchical learning “…suggest that from the outset of learning the search for relationships between context and action may occur at multiple levels of abstraction simultaneously and that this process differentially relies on systematically more rostral portions of frontal cortex for the discovery of more abstract relationships.”
  10. Authors work off a previous model of theirs for RL/working memeory and see if it can be extended for hierarchical use
    1. “In this model, the striatum modulates the selection of frontal cortical actions, including motor actions and working memory updating. It does so by gating the inputs to be maintained in frontal cortex (input gating) and gating which of these maintained representations has an influence on action selection (output gating).”
  11. Based on the extensions the model is more effective at working in hierarchical tasks
  12. Derive a Bayesian RL mixture of experts model to model the gating that is believed to occur in the brain
  13. Also try using the more strict neural model as an expert in the Bayesian mixture of experts <?>
  14. In terms of motor behavior, the premotor cortext selects candidate actions, and the basal ganglia “selectively amplifies representations of one of these candidates.”
  15. “Computational trade-offs indicate that it is adaptive to have separate systems implement gating and maintenance (…)”
  16. This relationship between prefrontal cortex and basal ganglia has been supported through studies on brain damage, as well as parkinsons
  17. “In cognitive tasks, it may be necessary to update and maintain multiple task-relevant items in working memory. In the corticostriatal network models, stimuli are selectively ‘input-gated’ into segregated PFC memory ‘stripes’ (structures of interconnected neurons that are isolated from other adjacent stripes; …). In this manner, motor response selection can then be contextualized by multiple potential working memory representations. However, in some scenarios only a limited subset of currently maintained PFC representations may be relevant for influencing the current motor decision (e.g., in tasks with multiple goals and subgoals, only some maintained representations are relevant for processing during intermediate stages). In such cases, the system can benefit from an additional ‘output-gating’ mechanism…
  18. Other pieces of information aside from working memory may also be relevant to decision making <natrually>
  19. The basic question is how to pick the motor responses and how to select what pieces of information are relevant to decision making
  20. Models are implemented at 2 levels of abstraction:
    1. “The first builds on existing neural models of corticostriatal circuits in reinforcement learning and working memory and extends this framework to accommodate hierarchical structure.
    2. The second is more abstract, but is still based on the neural model.  The idea is that this model can test whether individual behavior is based on behavior consistent with a hierarchical internal model or not
  21. The experiment was set up so a stimulus had 3 dimensions, and 3 actions could be selected.  In one version, the only option was to learn individual mappings from each stimulus to an action.  In the other version, there was structure to the problem so that rules could be composed that defined the optimal policy based on features of the problem set in a hierarchy
  22. The Bayesian model is composed of 3 experts – one for each action <?>
  23. Gating of experts is done according to state and accuracy at that state, as opposed to simple overall accuracy across the entire history
  24. The Bayesian model is tweaked to be mathematically a little suboptimal, in order to be more consistent with the neural model
  25. Weighing of experts and decision making by experts is both according to softmax
  26. The method of combining experts coverges to <soft?> optimal
  27. In the extension to the hierarchical model, each expert is composed of lower level experts
    1. Each lower expert can attend to a particular dimension of the input
  28. Individuals responses are checked for evidence of hierarchical understanding by doing a max likelihood fit to the hierarchical model according to their responses and looking at the weights
  29. They do some stuff to cut down on the number of hierarchical experts needed by parameterizing across all hierarchical experts and then estimating the parameters to decide which hierarchical expert to use (instead of training a polynomial number of them)
  30. As expected, the hierarchical model performs better than flat only in the case where the task actually has hierarchical structure
  31. An experiment where dopamine was turned off in the model lead to behavior that is not adaptive to reward, so in the model it is critical to improve performance
  32. Parts of the model that are related to hierarchy get downweighted with experience when the task is actually flat, because their assumptions dont fit the data – they have fMRI data that also shows this in the brain
  33. “Thus, these simulations show that the network learns hierarchical structure. This structure is abstract as it is not tied to any given feature but rather the general tendency for the higher order dimension (color) to increase propensity for gating the lower level dimension (shape/orientation). Thus, once these weights are learned, it is not clear that the ‘‘color’’ units in prePMD should be labeled as such because they now represent which of the other dimensions is relevant…
  34. “In summary, the neural model supports the notion that multiple BG–PFC circuits interact using standard reinforcement learning principles, modulated by dopaminergic prediction errors, to solve hierarchical tasks. Application of this model to a range of other hierarchical (RL and non-RL) tasks is outside the scope of the current study but is currently being investigated.
  35. Analysis of individual responses shows there is a large variability in how hierarchical their behavior seems to be
  36. Also a problem of overlearning where subjects may learn hierarchical structure in one block and then apply hierarchical rules to a flat problem in the next block of choices which leads to suboptimal behavior
  37. The analysis of the fMRI data is done in the companion paper
  38. “Our model builds on these prior notions by suggesting that—at least when rules have to be learned—the influence of anterior PFC on posterior PFC may be indirect (rather than directly corticocortical) such that action selection at one corticostriatal level is constrained by inputs from more anterior levels. In other words, hierarchical control may emerge, in part, from multiple nested frontostriatal loops.
  39. <Skipping related work section>

Notes from the companion paper on fMRI results:

  1. <I’m planning on skimming this as the primary paper was fairly long>
  2. Basic findings are:
    1. fMRI activation occurred in dorsal premotor cortex (PMd) and more rostral areas of premotor cortex (prePMd).  Learning declined in prePMd “…when no abstract rule [hierarchical?] was available”
    2. Subjects were capable of learning and applying abstract rules when they existed
    3. Variation across subjects of activation of prePMd early (but not late) in learning correlated with successful discovery of abstract rule when one existed in the task
    4. “striatum (caudate and putamen) showed functional connectivity with prePMd and PMd that was consistent across learning at different levels of the hierarchy.”
  3. These findings mean that search for rules may be undertaken based on context (contents of working memory), and may occur at multiple levels of abstraction at the same time
  4. “To summarize, then, based on an individual’s trial-to-trial sequence of responses and rewards, the MoE permits computation of 3 types of variables which can be correlated with the BOLD response:”
    1. The weight for each expert
    2. “whether model estimates of RPE, both generally and specifically associated with expected outcomes given a hierarchical rule, are systematically associated with regions of straiatum and frontal cortex…”
    3. If negative prediction error is associated with reduced activation in prePMd as predicted by the neural model
  5. Based on the model, 3 predictions:
    1. Activation in prePMd is related to more hierarchical understanding of the task
    2. “RPE [reward prediction error] associated with testing a hierarchical rule will be specifically associated with a local circuit between prePMd and striatum
    3. “RPE associated with hierarchical rule will correlated with individual differences in the decline in activation in prePMd during learning of the flat rule set.”
  6. prePMd is believed to be necessary but not sufficient for attention to hierarchy – and estimated weights don’t come directly from prePMd activity but something related to it

Using Relative Novelty to Identify Useful Temporal Abstractions in Reinforcement Learning. Simsek, Barto. ICML 2004.

  1. “… states that allow the agent to transition to a different region of the state space are useful subgoals, and propose a method for identifying them using the concept of relative novelty.
  2. Many earlier papers deal with planning when given a hierarchical abstraction, but the goal here is to find the abstraction
  3. HEXQ uses the heuristic of figuring out the frequency of feature change, with a layer of hierarchy for each variable, and each child operates a smaller sub-MDP
  4. Other methods are finding common subpolicies among collections of tasks
  5. Others are to measure visit count, reward gradient, or Maxflow
  6. The algorithm we propose also identifies subgoals in the state space and creates temporally-extended activities that take the agent efficiently to these subgoals. Our subgoals are states that allow the agent to transition to a part of the state space that is otherwise unavailable or difficult to reach from its current region. We call them access states. Some examples are a doorway between two rooms…”
  7. These subtasks can aid exploration <when it is undirected>, as well as transfer
    1. <The more I think of it, doing an experiment with the combination lock feels like it makes sense – the type of exploration people do has implications to the usefulness of hierarchy>
  8. Relative novelty is the metric used to determine whether to set a state as a subgoal
  9. And therefore the algorithm is called the Relative Novelty algorithm (RN)
  10. The algorithm only considers structure and does not consider reward
  11. Can be used in a sort of anytime mode <?>
  12. Novelty is defined as visitation to a state/region not “visited recently”
  13. Access states are targets, non-access states are non-targets
  14. The novelty of a state is defined as 1/sqrt(n_s)  where n_s is the number of times s has been visited
    1. This can also be defined over a set of states, and in that case, it is simply in terms of the mean visit count
  15. They are measured in a “short-term measure of novelty.”  This requires that computations aren’t based on the entire history, so some form of forgetting is needed.  They mention that decay can be used, but they just do a periodic reset – in episodic domains the end of the episode is a natural place to reset
  16. The relative novelty of a state is the ratio of the novelty of itself and following states to preceding states
    1. The amount of history that is factored in this way is a parameter
  17. Because the novelty is based on the trajectory (and not the actual structure of the domain) this measure will change from visitation to visitation, but targets should most often have high novelty
  18. <They have a figure depicting relative novelty score vs proportion of observed scores with different curves for target vs non-target.  Its very strange because the two curves really sit on each other for almost the whole graph, with a small amount of separation in the middle.  There are no error bars, so especially because of the closeness of the two curves and the fact that they are characteristically similar means the graph is of limited use. I imagine this can’t be the metric they are using to separate them, as they are mostly non-separable by this criteria>
  19. They cast detecting subgoals as classification problem
  20. They just come up with a thresholding rule
  21. Limitations of using the approach in domains where episodic experience is representative of the task <does this mean it must have good coverage?  I’m actually not clear on why this is necessary>
  22. Propose setting the threshold by measuring ROC
  23. When creating options, the initiation set is are the set of states visited “shortly” before novelty measured above threshold.
  24. Options are learned by experience replay with a reward function designed for that option
  25. Experimental results in Rooms and Taxi
  26. The competitor algorithm is Q-learning with e-greedy.  Naturally vanilla QL loses
  27. Subgoals identified by the method are what you would expect
  28. They note it would be nice to compare other methods that develop subgoals
  29. Mentions an algorithm Q-Cut and that that algorithm and this are trying to do the same thing, but that Q-Cut does so globally, and approximates this via a local method
    1. Naturally, as a global method, Q-Cut is expensive, with a O(n^3) computational (and probably memory) cost
    2. <They say this alg is O(1), but I think that is not accurate, as the above does all states for O(n^3), but in this algorithm is O(1) for each transition in each episode, so applied to all states its at least O(n), and then the above doesn’t have to be recomputed whereas this does, so as the number of episodes grows, the amortized cost of Q-Cut is constant, while this is linear>