Intrinsically Motivated Hierarchical Skill Learning in Structured Environments. Vigorito, Barto. IEEE Transactions on Autonomous Mental Development 2010

  1. Covers intrinsic motivation for learning in hierarchical environments
  2. “Using Bayesian network structure-learning techniques and structured dynamic programming algorithms, we show that reinforcement learning agents can learn incrementally and autonomously both the causal structure of their environment and a hierarchy of skills that exploit this structure.”
  3. Motivation is “ensemble” <what I think of as transfer> learning
  4. Mentions structured value iteration for VI in domains with compact representations
  5. Factored representations lead to sparseness
  6. Naturally, mention VISA by Jonsson, Barto
  7. But VISA needs a model of the domain to work from (in the form of DBN).  The goal here is to do model building so that assumption can be removed
  8. Produces results that are recursively optimal but may not be hierarchically optimal – full solution may be suboptimal, even though policy for each option is optimal
  9. Optimize DBN according to Bayes Information Criterion (BIC)
    1. Doing this optimally is intractable, they use a greedy heuristic
    2. Basically involves building decision trees using BIC (use a chi-squared test)
  10. Actions are purely explorative; are selected s.t. leaf (leaves) of <s,a> maximize change in entropy in leaf
  11. In original VISA, algorithm <ironically> doesn’t explore thoroughly enough, because action selection is myopic in terms of improving knowledge of DBN only as far as current state is concerned, so there is no real directed exploration in the way that RMAX does, mention a few possible fixes <although its not clear if any of them are “smart” in an RMAX sense>
    1. Mention work by Schmidhuber for intrinsic motivation (and its problems) <but why not RMAX as it does things correctly?>
    2. Then mention stuff by Barto that is better, but isn’t designed for factored domains
    3. <Oh, they will go into KWIK, Carlos’ stuff later in related work>
  12. To do learning, maintain a set C of what are called controllable variables, which are variables which the agent knows how to set to any possible value that feature can take
  13. When choosing an action, look for features that may be changed if more information is achieved.  Then make sure that features ancestors in the DBN are controllable.  If so, set up a plan to try and change the feature
  14. So how do you know when to stop refining your model for a feature in stochastic environments?
  15. <Its a little unclear to me how this happens.  There is some expectation for the probability that the agent will be able to change a feature to a particular value.  If it fails to do so with a ratio a certain value less than that, it is abandoned?  This doesn’t make sense so I must not understand it correctly>
  16. Looks like they do the RMAX trick, but the feature they pick to assign value to is the one that has controllable sources and the highest potential change in entropy <I thought Bayes info – is it the same?>
  17. “When options happen to be created prematurely and are malformed, their lack of utility is discovered fairly quickly by the agent when it attempts to use those  options in its experimental plans and they fail repeatedly. These options will be removed from the agent’s skill set until the agent performs more experiments relevant to discovering their structure, at which point they will be re-created and tested in further experiments. Once a correct option is learned, its empirical success rate will on average match its expected success rate, and the option will remain in the agent’s skill set to be used in
    all further experiments.”
  18. Experimental results
  19. The light box domain has 20 “lights” or binary variables: ~1 million raw states, 20 million <s,a> pairs
    1. Lights are separated into categories
      1. Circular, which are controlled directly by their switch
      2. Triangular, which are turned on if a particular set of circular lights are on (with own respective switch)
      3. Rectangular which depend on triangular (with own respective switch)
      4. Diamond which depend on rectangular (with own respective switch)
    2. Stochastic; actions “work” with p=0.9
    3. “However, if an action is taken to toggle a light whose dependencies are not currently satisfied, the entire domain is reset to all lights being off.”
      1. No way random actions can get anywhere
    4. Learning must be active so that exploration can be done in an intelligent directed manner
  20. <Seems to learn effective, but not “correct” policies, dependencies that are flat (1,2,3 needed on to turn on 4, but 1,2,3, independent of each other) end up being constructed serially so it looks like they are dependent (1, then 2, then 3 – even though they could be done in any order)>
  21. Not surprisingly, global (directed) exploration is the only thing that works in this domain
  22. Planning times for planning with options+primitives vs just primitives is options is flat with increasing problem complexity (level in the hierarchy — circular, triangular, rectangular, diamond) while primitives only has exponentially increasing planning cost with increasing complexity
  23. Mention KWIK algorithms for DBNs, knock them for having limited (and exponential cost in) in-degree, but you must have that in general problems to get an optimal solution – the greedy approach here only works in submodular domains

In Transit to Constant Time Shortest-Path Queries in Road Networks. Bast, Funke, Matijevic, Sanders, Schultes. Algorithm Engineering and Experiments (ALENEX)

  1. Beats previous best results by two orders of magnitude <was also published in Science, but the article is so short I’m not taking notes on it>
  2. In worst-case Dijkstras cannot be below linear because it may have to examine all nodes and vertices
    1. The only way to beat this is by preprocessing
  3. Constant query time can be achieved by superlinear storage <Upon looking up that reference, it really looks linear.  The response may be slightly suboptimal (but this amount is constant in the size of the graph)>
  4. The improvements in quality and speed here exploit the fact that roads have particular properties, like low branching factors, fact that highways are designed to be used for long distance travel…
  5. Idea of transit node is “that for every pair of nodes that are ‘not too close’ to each other, the shortest path between them passes through at least one of these transit nodes.”
    1. Additionally, even when travelling very far, the number of transit nodes that must be used is very small (like 10)
  6. In the US road data they consider, 24 million nodes, and 10,000 transit nodes
  7. Example (paper has figure): “Finding the optimal travel time between two points (flags) somewhere between Saarbr¨ucken and Karlsruhe amounts to retrieving the 2 × 4 access nodes (diamonds), performing 16 table lookups between all pairs of access nodes, and checking that the two disks defining the locality filter do not overlap. Transit nodes that are not relevant for the depicted query are drawn as small squares.”
  8. Mentions that there are special classes of algorithms for working on planar graphs, which roads (basically) are
  9. They have a previous paper that hierarchically builds paths based on increasingly long paths <should check out>
    1. This paper is by the same authors, and ends up doing something very similar to transit nodes (they are just nodes high in the hierarchy).  In the earlier paper access nodes were computed on the fly whereas here they are precomputed
    2. It also doesn’t deal with how to figure out the distance in the distance table is indeed the shortest distance
  10. A* is actually one of the best methods for real world planning (when done around landmarks and the the triangle inequality).  Here, access nodes can be used as the landmarks
  11. The hierarchy approach also needs to be grounded in actual geographic information to work <at small scales?>
  12. Nodes are labeled either close/not close to individual access nodes
  13. Access nodes are found by backwards search by Dijkstra’s, and do this until all paths converge on another transit node, then take all nodes found through this search that do not reach another transit node
  14. There are a number of ways to implement a locality filter; the one they recommend is to examine everything that is reachable within a cluster in the hierarchy
  15. This information allows for what is effectively constant time queries for shortest paths
  16. There is another way to implement the algorithm that is grid based – uses 2 grids of different resolution
    1. Needs some optimization to function efficiently
    2. Computing access nodes in grid-based implementation is very simple
  17. With the optimizations, long-distance queries via transit nodes is actually faster (1 million x) than local queries.  Although on average local queries that dont go through access nodes are about 1% of queries <not sure how this is measured>  it still dominates the running time
    1. There is a scheme also that allows speedups for local queries as well
  18. In the grid setting, there are tradeoffs between space, number of transit nodes <both increasing with resolution> and #local queries that don’t go through transit nodes <decreasing with resolution>
  19. Ah, to get the best of both worlds they use a hierarchy of grids
  20. The exact algorithm used to compute access nodes and transit nodes is confusing, and their diagram doesn’t help either.  Its written in English but would be much more understandable written in math (or at least pseudocode)
  21. Queries are super-easy to run (if nodes are more than 4 cells apart, path must go through a transit node, which already have pairwise distances computed)
  22. Can do hierarchical <seems to be just multiple griddings of different resolutions but I think there is another paper that may go into it more thoroughly>
  23. Do a greedy set cover to optimize storage for access nodes
  24. Implemented on data with 24m nodes, 60m edges
  25. Run on a ’07 dual-proc machine, 99% of queries have an avg time of 12 microseconds (adding the extra 1% moves it to 63 msec – are those the ones that don’t go through transit nodes?)
  26. Computation times are not “Dijkstra rank dependent”
  27. <Ok that’s not the end of the paper – just for the grid part, now moving on to a more fully hierarchical method>
  28. Highway hierarchies
  29. An edge is a highway edge is an edge that is in the shortest path between two vertices, but not in the local neighborhood of those vertices
  30. They also may contract edges based on a bypassbility criterion, but that isn’t yet discussed at length
  31. The hierarchy is then constructed by recursively building highway networks on top of each preceeding layer
  32. With increasing path distance, paths will go increasingly higher in the highway hierarchy
  33. So the nodes existing at some level in the hierarchy can be used as the transit nodes
  34. #transit nodes controlled by neighborhood size and level of hierarchy
  35. “Note that there is a difference between the level of the highway hierarchy and the layer of transit node search” <on this read through the distinction is actually not clear to me – looks like they are basically equivalent but in reverse order?>
  36. There is a gloss over how description of how access nodes are found – another paper is referenced <which I will try and get to> and point #20 is referenced
  37. Present 2 methods of the algorithm, which have different tradeoffs in terms of space, preprocessing, and query time
  38. Search is done top-down
  39. “For a given node pair (s, t), in order to get a complete description of the shortest s-t-path, we first perform a transit node query and determine the layer i that is used to obtain the shortest path distance. Then, we have to determine the path from s to the forward access node u to layer i, the path from the backward access node v to t, and the path from u to v”
  40. There is still room for improvement for the locality filters
  41. For US data, 99% of cases are handled at the top layer
  42. Search times start low and increase as searches are still local but further away, and then drop extremely low once nodes are far enough that transit nodes can be leveraged
  43. <On to conclusion>
  44. “Building on highway hierarchies, this can be achieved using a moderate amount of additional storage and precomputation but with an extremely low query time. The geometric grid approach on the other hand allows for very low space consumption at the cost of  slightly higher preprocessing and query times.”
  45. “There are many interesting ways to choose transit nodes. For example nodes with high node reach [9, 6] could be a good starting point. Here, we can directly influence |T |, and the resulting reach bound might help defining a simple locality filter. However, it seems thatgeometric reach or travel time reach do not reflect the inhomogeneous density of real world road networks. Hence, it would be interesting if we could efficiently approximate reach based on the Dijkstra rank.
    Another interesting approach might be to start with some locality filter that guarantees uniformly small local searches and to view it as an optimisation problem to choose a small set of transit nodes that cover all the local search spaces.”

How to Use Spanning Trees to Navigate in Graphs. Dragan, Xiang. FOCS 2009.

  1. Investigates how to navigate graphs efficiently using the spanning tree T of a graph G
  2. Navigation is done through knowledge of a nodes neighbors (or k-neighborhood) and T
  3. In wireless networks, a common heuristic is to simply move packets to the neighbor that is geographically closest to the destination.  Of course this scheme can run into problems so there are various methods to try and escape local optima
    1. If geographic distances arent applicable or unkown, other similar methods measure location based on proximity to a number of “beacon” nodes.  This usually works well in practice, but does not have theoretic guarantees
  4. If however, an embedding of the spanning tree is used instead of beacons, delivery is guaranteed
  5. Its possible to encode node labels compactly s.t. node distances can be found simply by comparing labels (of size log^2 n)
    1. So minimum spanning tree routing can be done just with these particularly encoded labels and an adjacency list for each node
  6. In some classes of graphs routing according to the MST will be optimal
  7. There are two additional tree routing strategies (aside from the obvious one) that need a small amount of information for each node (logarithmic) that encodes how far up the spanning tree it is
    1. These two strategies are “simpler and more compact” than the vanilla method, and in some cases the paths created will be shorter
  8. The results are all for particular classes of graphs which I’m not going to investigate in particular

How Optimized Environmental Sensing helps address Information Overload. Guestrin. Talk 2009

  1. When working in non-submodular functions there is <usually?> a trick that allows the problem to be solved.
    1. That is SATURATE
    2. Basically its as good as you can do
    3. Works by artificially allowing more sensors than allowed and then truncating the actual selection
    4. Of course this isn’t for all problems, but a subset of problems that are not natively submodular

Adaptive Submodularity: A New Approach to Active Learning and Stochastic Optimization. Glovin, Krause, Ray. Talk (NIPS 2010)


  1. Basic idea is consider submodular functions, they they are stochastic and unfold over time.  So instead of making all decisions in one shot (before execution begins), better to wait to see how the first decision unfolds before making the second, etc.
    1. Gives the example of influence in social networks – if you want to get people to plug a product you give to a few people for free, see how influence is spreading from the first person before deciding the second
  2. The setting described here is a generalization of standard suboptimality
  3. Analysis here allows many results from standard submodular optimization to the adaptive setting
  4. Instead of maximizing margin, maximizes expected margin based on current state, but does this only for the first selection, and the results of the first selection on state then become conditions for the next choice
  5. Gets standard bounds of 1-1/e
  6. Because adaptive, can now do things like “take the minimal number of actions needed to achieve X”, which you couldn’t do with standard submodularity (there, you can just say “maximize the benefit of N choices”
    1. This is a log-approximation ln(n)+1 for stochastic set cover
  7. Another application optimal decision trees
    1. “Generalized binary search” which is equivalent to max info gain
    2. This problem is adaptive submodular <I thought things like xor in decision trees arent submodular, but maybe because we are talking w.r.t. information gain it is?>
    3. Give bounds for this approximation
    4. “Results require that tests are exact (no noise)!”
  8. If trying to do decision trees in noisy setting, can do a Bayesian version
    1. But generalized binary search, maxing info gain, and maxing value of info aren’t adaptive submodular in noisy setting.  In noisy settings they require exponentially more tests than optimum <Should really read this paper>
  9. Trick is to transform noisy problem into normal problem by making outcome part of the hypothesis
    1. Only need to distinguish between noisy hypotheses that lead to different decisions
  10. Gives bounds for this Bayesian active learning for noisy domains
  11. Give an example of testing different psyhological/economic models of how people perform decision making
    1. For this setting, turns out many common approaches actually do much worse than random

Near-optimal Nonmyopic Value of Information in Graphical Models. Krause, Guestrin. UAI 2005.

<Working off the arxiv version, although it looks the same as the UAI one>

  1. For submodular functions, gives proof of a tractable approach for solving combinatoric problems that is within a factor of (1-1/e-ε) for ε > 0.
    1. Shows this is pretty tight, and that nothing can be better than (1-1/e) unless P=NP
  2. In terms of sensor networks, the goal is to maximize information gain, or equivalently, “decrease uncertainty about unobserved variables”
  3. A common criterion for node selection in sensor placement is to pick the location that has maximal entropy, given the previously selected locations.  This method has a fault, however, in that it will push sensed locations to the edges of the space.  The problem with this, in particular, is that sensors then have about half of their possible observation area (if its spherical) clipped by the boundaries
    1. Thats why the approach here, which is to minimize entropy of unsensed locations is better; it ensures more even coverage by sensors
  4. A set function is submodular if <among other possible things> it satisfies the diminishing returns property, that is, if a new piece of information is added to a set, or its superset, the gain in F will be larger with a set than its superset
    1. The marginal increase of X to A is the increase in F between A and X
  5. Exact maximization of submodular functions is NP hard (by reduction from max-cover)
  6. F also must be non-decreasing (like information “the information never hurts principle”; adding more sensors can never reduce the amount of information you have)
    1. So dealing with entropy is submodular.  Information gain is as well
  7. Basic theoretical results are from G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of the approximations for maximizing submodular set functions. Mathematical Programming, 1978.
  8. Information gain is not always submodular <xor pattern in decision trees>, but given independence assumptions it is
  9. “A convenient property of submodular functions is that they are closed under positive linear combinations. This allows us, for example, to have temperature models for different times of the day, and select sensors that are most informative on average for all models.”
  10. Versions with budgets, individual unique costs per item
  11. Unit-cost case is squared cost, budgetedis O^5, but if willing to take another hit on accuracy this can drop down to cubic
  12. Has links to simulated annealing
  13. <Skipping the proof that the greedy bound is tight with any non-exponential solution>
  14. The ε in the proof comes from the fact that accurately computing conditional entropies as necessary is #P <I’m actually not familiar with what that means offhand, but based on context it must mean more than poly>, so a method of estimating entropies is used instead.
  15. <on to experimental results>
  16. In the first experimental result they place temperature sensors around their lab.  The quality of predictions with their method using 5 sensors is better than 20 sensors placed the standard entropy approach (that pushes sensors to boundaries and wastes part of their field)
  17. They had to lay down tiles to segment the continuous space of the lab <although in some segments there is more than one sensor, so I’m not sure exactly what they are doing.  Oh it looks there are candidate positions pre-selected, which have line-of-sight connectivity, so the hierarchy doesn’t impact that graph but sits on it.  Still not totally sure whats going on>
  18. Even when the assumptions of conditional independence are violated, the results are still good, worse than standard entropy with less than 5 sensors, but better with more than that
  19. Graph that shows linear run time <I thought it was poly, but they mention a simple and complex model, so one may be linear and the other poly?>
  20. Next results are on highway data.  Like in the temp results used a hierarchical model <although I’m still not sure what that means exactly>
  21. Here, they compared placement based on information gain, entropy, and random choice.  Random actually yielded more information gain than the entropy-based approach: “This indicates that maximizing entropy and information gain appear to be conflicting goals, resulting in qualitatively different behaviors.”

Transit Node Routing Reconsidered. Arz, Luxen, Sanders. Experimental Algorithms 2013.

<These notes are from the arxiv version>

  1. “Transit Node Routing (TNR) is a fast and exact distance oracle for road networks.”
  2. This paper
    1. Provides a means of speeding up preprocessing by an order of magnitude
  3. Basic idea in contraction hierarchies is to do some preprocessing to allow for speedups of Dijkstras
  4. TNR is one of the best methods, and can get queries down to almost constant time (a small number of table lookups)
  5. The basic assumption behind TNR is that queries that span long geographical distances almost always go through centralized connections <highways> called transit nodes

    1. Generally, there are only a few entrances to each transit node, so distances to these entrances as well as pairwise between transit nodes is stored
    2. This information allows for a locality filter, which “… indicates whether the shortest path
      might not cross any transit nodes, requiring an additional local path search.”
  6. A couple of issues with TNR traditionally is very expensive preprocessing, as well as the need for geographic information of the nodes (which a little bit of a poor fit when dealing theoretically with graphs; the hope is that would be abstracted away)
  7. On a rough level, contraction mappings order nodes by importance, and contracting them — removing them temporarily from the graph, and inserting new shortcut edges between that removed node and its neighbors only when necessary to maintain shortest path distances
  8. When doing search, all original edges are considered, as well as the additionally processed shortcut edges (although only those that lead to more important nodes).  <Its not clear if also the original edges are filtered this way, but it seems like it is the case, as the resulting graph forms a DAG.>
    1. Leads to a graph with particular structure (all shortest paths now have a midpoint), which allows for a Dijkstra’s with a minor extension to be used
  9. Basically, these tricks allow for linear as opposed to quadratic all-pair shortest paths to be computed
  10. “TNR in itself is not a complete algorithm, but a framework. A concrete instantiation has to fi nd solutions to the following degrees of freedom: It has to identify a set of transit nodes. It has to fi nd access node for all nodes.”
  11. “CHs order the nodes in such a way that nodes occurring in many shortest paths are moved to the upper part of the hierarchy. Hence, CH is a natural choice to identify a small node set which covers many shortest paths in the road network.”
    1. <This is a loose definition of course, and for a real-world mapping problem you can’t hope to tackle it with an exact solution, so the trick must be what sort of heuristic to use>
  12. The algorithm presented here does not require geometric information about the nodes, resolving that wrinkle
  13. <skimming some parts of the papers, like the lemmas>
  14. <the remainder of the paper is empirical results – actually on a DIMACS dataset of Europe with about 20M nodes/edges>
  15. <discussion of application of other common graph-search extensions/heuristics>
  16. <discussion>
  17. “In particular, at the price of twice the (quite fast) preprocessing time of contraction hierarchies, we get two orders of magnitude faster query time. Our purely graph theoretical locality filter outperforms previously used geometric filters. To the best of our knowledge, this eliminates the last remnant of geometric techniques in competitive speedup techniques for route planning.”
Tagged ,

A Local Algorithm for Finding Well-Connected Clusters. Lattanzi. Mirrokni, Zhu. ICML 2013 Talk.


  1. A good cluster should have few edges going relative to the size of the size of the cluster
  2. Called the conductance (clustering by cutting according to that criteria)
  3. Google research, so google scale, doesn’t want even linear data cost
  4. Algorithm is initialized on some vertex, moves around locally, and then finally produces some set S, hopefully with a low conductance
  5. Has theoretical bounds, cant do better than that due to Cheeger’s inequality
  6. Algs have runtime linear in size of cluster made from starting point
  7. Algorithm is quite simple
  8. Proposes a different metric than standard conductance; connectivity
  9. Want small conductance and high connectivity (imposed by a relative combination); want better connection inside than outside
  10. This metric improves theoretical bounds, this beats other random walk based spectral algorithms
  11. Use of this extra data allows for tighter bounds
  12. <Proofs>
  13. Claim a followup work that is local but is big-O optimal based on flow analysis

Learning Policies for Contextual Submodular Prediction. Ross, Zhou, Yue, Dey, Bagnell. (ICML 2013?) Talk.


  1. Related to RL but not a RL talk – has applications to lots of domains
  2. For robot motion planning, a common method is to start with some start state, produce a simple seed trajectory (perhaps based on features of start state), and then iteratively correct the trajectory so that it is out of collision through some local optimization.  If you cant do this, get a new seed trajectory and start again.
  3. Problem is due to time constraints, can only work from a small number of seeds, want to find any collision free path
  4. Related tasks: predicting news articles to read based on search history
  5. Both seed trajectories and recommended articles should have some sort of diversity
  6. These tasks are monotone submodular; adding seeds or recommended articles can only help, but benefit diminishes
  7. Greedy approach its within 63% of optimal
  8. Extends standard machine learning to be about making lists of predictions instead of single predictions
  9. Here, you dont know what the evaluation function is, but can try and learn what greedy approach works well on training set (based on features) and apply the same rules to the test set
    1. More specifically, make guesses based on environmental features as well as features of the current list of seeds
  10. SCP: Submodular Contextual Policy
  11. Iterative approach:
    1. Start with some policy for constructing lists (may start greedy) – proposes seeds
    2. Based on the set of seeds, compute new data set which is environmental features, as well as features of current items in list
    3. Trying to predict the benefit of adding a particular seed based on both sets of above features
    4. (Have to upweigh later seeds)
    5. Then this is run through standard RL (cost sensitive learner), which gives a new policy for selecting seeds
    6. Return to step 1
  12. ~63% of optimal, but extra error term from the supervised learning step <mentions that if its a no-regret learner that goes away, but what does that mean in this context?>
    1. Treats list generation as a bandit task basically
    2. Competes with result from oracle <pretty amazing>
Tagged ,

Behavioral and neurophysiological correlates of regret in rat decision-making on a neuroeconomic task. Steiner, Redish. Nature Neuroscience.

  1. Deals with regret as described in the RL literature as rats make decisions (they make a choice they can’t go back on, but may have gotten something better with a different choice)
  2. “In humans, the orbitofrontal cortex is active during expressions of regret, and humans with damage to the orbitofrontal cortex do not express regret.  In rats and nonhuman primates, both the orbitofrontal cortex and the ventral striatum have been implicated in reward computations.”
  3. In situations of high regret <presumably, the reward of the other option is presented, but not accessible> “… rats looked backwards toward the lost option, cells within orbitofrontal cortex and ventral striatum represented the missed action, rats were more likely to wait for the long delay, and rats rushed through eating the food after that delay.
  4. “Disappointment is the realization that a realized outcome is worse than expected; regret is the realization that the worse than expected outcome is due to one’s own mistaken action… that the option taken resulted in a worse outcome than an alternative option or action would have.”
  5. “Orbitofrontal cortical neurons represent the chosen value of an expected future reward… [and] has been hypothesized to be critical for learning and decision making, particularly in the evaluation of expected outcomes.”
  6. Ventral striatum is also implicated in evaluation of outcomes
  7. In rats, neural recordings show VS and OFC deal with reward, reward/value predictions.  “In the rat, lesion studies suggest that orbitofrontal cortex is necessary for recognition of reward-related changes that require inference, such as flavor and kind, while vStr is necessary for recognition of any changes that affect value.  In rats deliberating at choice points, vStr reward representations are transiently active before and during the reorientation process, but reward representations in OFC are only active after the reorientation process is complete.”
  8. The experiments are somewhat along the lines of the secretary problem – rats run around a loop and can stop at one of a number of places to get food.  When entering an area, a stochastic wait time (until food was released) was introduced.  A tone after entering indicated how long the wait would be.  Rats could either decide to wait for the food or proceed on.  Delays were IID and uniformly distributed
  9. Experiment run on 4 different rats – they all basically took a threshold approach where if the wait was less than a certain value they would wait, and otherwise they would move on.
    1. If they decided to skip the reward and move on, the delay on this was independent of delay.  Data indicates they made a decision upfront and did not simply wait for a period of time to move on (either they left after a short period of time, or waited around completely until food was delivered)
    2. Threshold between waiting and skipping tended to be related to each of the 1 of 4 flavors of possible food, one for each quadrant
    3. Upon delivery of reward, rats usually waited 20-25s to leave to get next rewards
  10. In variation of the task where one zone provided 3x amount of food, rats chose to wait longer for that larger amount
  11. 81, 86% of OFC, VS neurons responded to reward, respectively.  “Responses in bot OFC and vStr often differentiated among the four reward sites (…).”
    1. Used Bayesian classification to determine food quadrant from activity pattern (they included an extra zone from the 4 food quadrants to correspond to locations in the track between quadrants where food could not be obtained
    2. Results of classification between OFC and VS are overall qualitatively quite similar
  12. Both OFC and VS signals distinguish between zones both at time of entering zones, as well as at time reward was produced
  13. Responses in both areas when food delay was below threshold, but not so in cases when delay was above threshold
    1. “This suggests that these structures were indicating expected value, and predicting future actions.”
    2. This was tested specifically when presenting delays right around threshold (so the difference in possible delays was small, but rats would either choose to stay or go).  The same results were found, that activity was related to the decision and not the environment
  14. Now moving on to regret specifically
  15. For regret to be induced, the agent needs to know what outcome occurred, as well as what the expected outcome of all actions are – these conditions exist in this domain
  16. “Because the rats were time-limited on the Restaurant Row task, encountering a high-cost delay after not waiting through a low-cost delay means that skipping the low-cost delay was a particularly expensive missed opportunity.”
    1. These conditions did occur in the experiment, in some cases the rats would skip a below-threshold reward and then be faced with a high delay
  17. “Theoretically, the key to regret is a representation of the action not taken.”
    1. <Their interpretation of this is that is “… that there should be representations of the previous choice during the regret-inducing situations, particularly in contrast to control conditions that are merely disappointing.”  Usually in the RL literature it is done w.r.t. the expectation, but I guess this is reasonable here because the rat knows exactly what it passed up on – the noise is presented to the rat in the form of an audio cue.>
  18. To tease apart disappointment from regret they took sequences where the waits were the same, but the rat behavior differed.
    1. In this case, the rat acted optimally (by taking the short wait) but may be disappointed by the long wait following, as it may want to eat another time
    2. The second control is when cue 1 + 2 are above threshold (as opposed to just #2) and where the rat skipped both options.  Again, in this case, the options presented weren’t optimal, but the rat behaved optimally given the circumstances
    3. Experimental set-up had regret and control instances evenly distributed
  19. “Behaviorally, rats paused and looked backwards toward the previous option upon encountering a potentially regret-inducing sequence, but they did not do so in either control condition (…).”  In both the control instances, (where the rat acted correctly), the rat did not look back.
  20. “During potential regret instances, individual reward-responsive neurons in OFC and vStr showed activity patterns more consistent with the previous reward than the current one (…). Neural activity peaked immediately after the start of the look back toward the previously skipped, low-cost reward.”
    1. That is for individual neurons.  For the population as a whole, representation of the previous reward was weak, whereas representation related to the previous zone was stronger
    2. The representation of the previous zone (where the regret-inducing decision was made), did not occur in non-regret situations
  21. Regret didn’t only manifest through immediate behavioral <looking back> and neural responses but also in terms of future decision making.
    1. This is the case – rats tended to take the subsequent long delay (that they normally would reject) after rejecting the previous short delay (that they normally accept, and should do so optimally)
    2. They also ate *much* more quickly in the regret case where they accepted the bad offer than normal – the otherwise average case and both other controls are basically the same and are markedly different
  22. In regret cases, increased representation of the previous zone was correlated to accepting the bad offer, this wasn’t the case in controls, which both had a high-cost second choice
  23. There was a clear representation of the previous zone, but not other zones
  24. Earlier work implicates OFC and VS in calculating expectation of reward.
    1. “Our data indicate that violation of an expectation initiates a retrospective calculation of expectation, this retrospective calculation of expectation influences future behavior: rats are more willing to wait for reward after a regret instance.”
  25. “While some evidence suggests that OFC represents economic value, the representation of regret is more consistent with the hypothesis that OFC encodes the outcome parameters of the current, expected, or imagined state.  The data presented here are also consistent with the essential role of OFC in proper credit assignment.  Previous studies have identified potential representations of the counterfactual could-have-been-chosen option in rats, monkeys, and humans.”
  26. “The connectivity between OFC and vStr remains highly controversial, with some evidence pointing to connectivity and other analyses suggesting a lack of connectivity.”

Get every new post delivered to your Inbox.