Goals and Habits in the Brain. Dolan Dayan. Neuron 2013.

  1. Deals with “reflective” and “reflexive” decision making
  2. Reviews 5 generations of studies (some steps aggregated)
    1. Early to mid 20th century were the first to look into goal-directed and habitual behavior, then moving to
    2. Human neuroimaging studies, based on results from rodents
    3. Next 2 generations in terms of model-free and model-based RL
  3. Generation 0: Cognitive Maps
    1. Stimulus response (S-R) was based on the concept of stimulus becoming increasingly strongly associated with behavioral responses
    2. Others, however, argued that in maze tasks, for example, mice develop “a field map of the environment”
    3. Seems the first would be a model-free argument and the second model-based
    4. Experiments considered what learning occurs in the absence of reinforcement, which is called latent learning.
      1. Ex/ allowing mice to walk in a maze without any reward in a particular location
      2. These experiments show that animals exposed to the environment learn more quickly when given a task than those who were not
      3. Evidence of model-building
    5. All this stuff predated concepts of dynamic programming and RL
    6. There were also experiments that covered “vicarious trial and error” that are expressed in terms of motor hesitation and repetitive examination of the environment.
      1. Animals who exhibited this behavior more frequently turn out to be better learners
      2. Reduction of this activity during learning was taken as evidence that use of a cognitive map occurred less and more automatic control was taking place
    7. At the time, the idea of a cognitive map was new; now we know this does occur and is localized primarily in the hippocampus
      1. Likewise, hippocampal lesions in mice lead to an end of above “vicarious trial and error” behavior
  4. Generation 1: Goal-Directed Actions and Habits
    1. Although generation 0 gave some evidence of goal-directed behavior becoming habitual, there was the potential that this occurs only in spatial navigation tasks or that or some other (Pavlovian) relationship
    2. Therefore the first studies here had to do with a cognitive map in a nonspatial goal-directed domain (how do they define map then?), which was contrasted with the idea of habit
    3. Behavior is considered goal-directed if:
      1. The policy reflects knowledge of actions and consequences.  This is called response-outcome (R-O) control. <Consequences is a loose term; model-free approaches know that actions taken has a consequence of going to an unknown state of highest possible reward>
      2. The outcome should be desirable at the moment of choice
    4. On the other hand, habitual behavior is considered to be shaped by past reinforcement so its not tightly connected to the immediate context
    5. characteristics of habitual instrumental control include automaticity, computational efficiency, and inflexibility, while characteristics of goal-directed control include active deliberation, high computational cost, and an adaptive flexibility to changing environmental contingencies (Dayan, 2009)”
    6. P. 314 has a short outline of brain regions believed to be associated with different aspects of planning/behavior
    7. “This double dissociation makes a strong case that prelimbic regions are crucial for goal-directed performance, while infralimbic lesions prevent the emergence of habitual responding that overrides an initial dominance in goal-directed responding. However, it is likely that in the intact animal, there is a dynamic interdependency between goal-directed and habitual systems and that control is likely to emerge simultaneously and competitively”

    8. If habitual and goal-directed planning are happening simultaneously, its important to understand the integration and competition of these systems
  5. Generation 2: Actions and Habits in the Human Brain
    1. Many studies on humans rely on fMRI studies of experiments formerly conducted on animals
    2. On task where action selection produced food (where satiety was produced), actions leading to the satiated food were extinguished.
      1. Of note was the observation that the BOLD signal in a ventral sector of orbitofrontal cortex [vmPFC] decreased for a devalued compared to a nondevalued action, leading the authors to conclude that this region plays a role in goal-directed choice.”
    3. “… suggestion that orbital prefrontal cortex implements encoding of stimulus value with dorsal cingulate cortex implementing encoding of action value”
  6. Generation 3: Model-Based and Model-free Analyses
    1. Computational models of decision making allowed for making exact predictions about differences between habitual and goal-directed behavior
    2. <Here too, forward search is described as a tree, but more generally it is a DAG; when branches merge in a DAG is where you can get large savings from earlier estimates>
    3. <The Tolman Detour task isn’t a good model-based example.  You can set up the transition function so that when walking across the path a rock appears or doesn’t and then remains there forever, but seems more reasonable either its the case where the model is incorrect, or it can also be stated as a POMDP.>
    4. In a sentence discuss fact that search DAG can become huge quickly, and that limitations on mental faculties prevents exhaustive search in any interesting setting
    5. Model-free and TD errors
      1. Dopamine and predictive error
    6. <Says that model-free has minimal computation but large memory requirements.  This isn’t accurate, as representing the Q functions is less expensive than trying to represent the transition function.  I suppose in terms of working memory its cheaper.>
    7. Model-free methods are statistically inefficient
      1. <Generally yes, but delayed-Q is as good as anything from the model-based camp, but I guess when they say model-based they mean again goal-directed behavior>
    8. Model-free control has no immediate sensitivity to devaluation
    9. <I think another big issue here is that almost all of the theoretical work assumes nonstationarity.  In the real world, however, nonstationarity does exist. In this paper one of the big distinctions is drawn in the case where the reward is nonstationary and something that was desirable becomes undesirable or vice versa>
    10. These [initial imaging studies on human model-free control] showed that the BOLD signal in regions of dorsal and ventral striatum correlated with a model-free temporal difference prediction error, the exact type of signal thought to be at the heart of reinforcement learning. A huge wealth of subsequent studies have confirmed and elaborated this picture.”
    11. <Need to read about the experiments they discuss Daw et al. 2011, Glascher et al. 2010.  At least the way those studies are summarized here do not accurately reflect the math going on in model-free and model-based RL>
    12. There is also evidence of simultaneous control by a mixture of habitual and goal-based behavior
    13. <Daw et al. 2011 is mentioned again and the description is too brief for me to actually process. Will need to read that paper.>
    14. Evidence that the planning done during goal-based planning is used to train the model-free part, as a part of the brain believed to deal with TD-like errors is also active in some goal-based planning.
    15. At least in spatial navigation tasks, there is evidence of “rollouts” happening, as hippocampal regions activate starting with the region that represents the current location of the subject
    16. There is a region of the hippocampus that seems to be connected to reward processing and is used to define success (or failure) in goal-directed search
    17. Even while sleeping, animals seem to replay trajectories of experience (analogous to DYNA); backwards replay seems to happen too, which in models helps propagate reward signals from the goal to other states more rapidly, also DYNA-Q <Been a long time since I’ve read about that, don’t quite recall>
  7. Generation 4: Elaborations on Model-Based and Model-free Control
    1. Model-free vs model-based control, as well as model-based in isolation
    2. “white matter connections between premotor cortex and posterior putamen is reported to predict vulnerability to ‘slips of action'” – density in the putamen is also linked to such behavior
    3. Dopamine is believed to play a role in both systems; its been directly linked to TD-Errors, but it projects to other regions implicated in both model-based and model-free control
    4. Administering L-DOPA (boosts influence of dopamine) resulted in actions that were more model-based
    5. Evidence that anterior caudate nucleus is keeping track of progress in a planning decision tree, and putamen to habitual
    6. the vmPFC (ventromedial prefrontal cortex) seems to get whichever selected action comes from either of those 2 locations
    7. There is evidence, however, that what gets activated depends on the type of task.  For example, navigational tasks rely on the hippocampus, in some tasks the vmPFC doesn’t relate to value, etc
    8. Evidence that a model-based strategy is used to generalize learning
    9. Explicit instruction boosts the used of model-based (makes sense as that is where it immediately useful)
    10. Dyna-Q analogues
  8. Generation 5: The Future
    1. How does model based control actually happen (is it more complex than model-free?)
    2. How is disagreement between the model-based and model-free parts mediated?
    3. How is this relevant / how does it interact with other forms of conditioning?
    4. How do we prune when searching?  Dealing with the whole tree doesn’t seem possible
    5. Stuff like DYNA-2 (Silver), maintaining value functions and helping it guide forward search
    6. Is it possible problems are being turned into the sort of probabilistic inference problems that are tackled in the cortex to interpret input?
    7. There are two models in terms of the interplay between MB and MF.
      1. One is that MB is used by MF to help boost MF planning
      2. The other is that they compete
      3. Looks like both are going on to some degree, and the relative amounts of the two varies between individuals
    8. Theory of arbitration between the two is currently underdeveloped
      1. Want to trade inexactness of MF with time requirements of MB; perhaps MB is used until error of MF becomes low.  Some evidence that in tasks where TD can’t get so low (because of noise in the problem) MB is used more extensively
    9. Some evidence of actor-critic style models
    10. Relations to Pavlovian conditioning and psychopathology, the latter is outside of the realm of what I’m considering now though.
      1. OCD as overuse of habitual system?  Show lack of sensitivity to outcome devaluation
    11. Overdominance of model based perhaps related to paranoia, delusions, and hallucinations?
      1. Increasing dopamine function in Parkinson’s patients can lead to psychotic phenomena

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: