Habits, action sequences, and Reinforcement Learning. Dezfouli, Balleine.

  1. “…reconceptualizing habits as action sequences allows model-based RL to be applied to both goal-directed and habitual actions in a manner consistent with what real animals do.”
  2. Goal directed and habit learning processes seem to be localized on different parallel parts of the brain
  3. “We develop a model in which we essentially preserve model-based RL and propose a new theoretical approach that accommodates both goal-directed and and habitual action control within it… it also provides a model that uses the prediction error signal to construct habits in a manner that accords with the insensitivity of real animals to both reinforcer devaluation and contingency degradation.”
  4. When exploring animals tend to resist re-executing sequences of actions
  5. “…with persistence, actions change their form, often quite rapidly.  Errors in execution and the inter-response time are both reduced and, as a cosequence, actions previously separated by extraneous movements or by long temporal intervals are more often performed together and with greater invariance.”
  6. Practice makes  variable flexible MB behavior into rapidly deployable most invariant action sequences.
  7. “…neural evidence suggests that habit learning and action sequence learning involve similar neural circuits…” [pfc and associative striatum]
    1. As they become more routine they move to the sensorimotor striatum
  8. Dorsolateral striatum is related to habit learning (lesions there make habituated behavior easier to stop)

Actions, Action Sequences and Decision Making: Evidence that Goal-Directed and Habitual Action Control are Hierarchically Organized.   

[This paper isn’t written in a way that is easy to understand so I didn’t make it through]

  1. One perspective of MB vs MF is that they compete against each other and there is an arbitration mechanism
  2. Another perspective is that “…the interaction between theses systems has been recently argued to be hierarchical such that the formation of action sequences underlies habit learning and a goal-directed process selects between goal-directed actions and habitual sequence of actions to reach the goal.”
  3. They argue based on experiments that action sequences is going on
  4. Their Bayesian model outfits a flat model
  5. “Although these findings do not rule out all possible model-free accounts of instrumental conditioning, they do show such accounts are not necessary to explain habitual actions…”
  6. “Although these features of goal-directed and habitual action are reasonably well accepted, the structure of habitual control, and the way it interacts with the goal-directed process in exerting that control, is not well understood.”
  7. Habits as execution of open-loop sequences of behavior
  8. “On this hierarchical view, such action sequences are utilized by a global goal-directed system in order to effectively reach its goals. This is achieved by learning the contingencies between action sequences and goals and assessing at each decision point whether there is a habit that can achieve that goal. If there is, it executes that habit after which control returns to the goal-directed system.”
    1. Their example is not appropriate for the system they describe, where walking across the street is a sequence of actions – this is specifically where you need conditional behavior.  If there is a car, wait, if not, walk.
    2. If there was a mechanism for interrupting the sequence I suppose it would be allright, or if the sequence was to walk far enough to check for traffic, but overall this view seems to simplistic (dangerous) to be viable
  9. Make an argument that results from Daw 2011 arent due to mixed MB+MF learning but instead action sequences
  10. Basically it looks like whenever a 2nd level action is rewarded, the first and second action will be repeated even if the state reached at the second level differs, and should be responded to with a different action
  11. I’m not following an argument they make which seems to be important:
    1. Previously, we focused on trials with a different slot machine to the one in the previous trial.
      This was because, in this condition, flat and hierarchical accounts provide different
      predictions. When the slot machine is the same, both accounts (flat and hierarchical) predict
      that being rewarded in the previous trial increases the probability of staying on the same
      second stage action. In addition to this prediction, the hierarchical account predicts that
      when the slot machine is the same as the one on the previous trial, this increase should be
      higher than the increase when the slot machine is different. This is because, when the slot
      machine is different, staying on the same second stage action is drive by execution of the
      previous action sequence whereas, when the slot machine is the same, executing either the
      previous action sequence or a goal-directed decision at the second stage can result in staying
      on the same second stage action.

    1. Oh I think this is so confusing because when they say hierarchy they mean action sequence?  Isn’t exactly what hierarchy isn’t?  Introduce a different term instead of redefine an existing one
    2. Also there are two bandits but they are unclear about which one is being discussed.
    3. I think this is critical and I just can’t make sense of it.  The moral of the story is that sometimes you need to write actual math to make things understandable.  Skipping the remainder.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: