How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis. Collins, Frank. European Journal of Neuroscience 2012.

  1. Uses an experiment specifically designed to tease apart contributions of working memory (WM) and the part of the brain more traditionally associated with RL (“…corticostriatial circuitry and the dopaminergic system.”)
    1. “By systematically varying the size of the learning problem and delay between stimulus repetitions, we separately extracted WM-specific effects of load and delay on learning. “
  2. Propose a new model for interactoin of RL and WM
  3. “Incorporating capacity-limited WM into the model allowed us to capture behavioral variance that could not be captured in a pure RL framework even if we (implausibly) allowed separate RL systems for each set size.  The WM component also allowed for a more reasonable estimation of a single RL process.”
  4. Also some genetics work <although I will probably go light on it in this reading>
  5. “Activity and plasticity in striatal neurons, a major target of dopaminergic efferents, are dynamically sensitive to these dopaminergic prediction error signals, which enable the striatum to represent RL values (O’Doherty et al., 2004; Frank, 2005; Daw & Doya, 2006)”
  6. “Thus, although this remains an active research area and aficionados continue to debate some of the details, there is widespread general agreement that the basal ganglia (BG) and dopamine are critically involved in the implementation of RL.”
  7. Humans, at least probably rely on more than simply prediction errors, we have the ability to do forward search, for example
  8. Genes controlling dopaminergic function may cause changes in behavior either after initial learning has occurred or during learning.  “Similarly, functional imaging studies have shown that dopaminergic drugs modulate striatal reward prediction error signals during learning, but that these striatal signals do not influence learning rates during acquisition itself; nevertheless, they are strongly predictive of subsequent choice indices measuring the extent to which learning was sensitive to probabilistic reward contingencies (Jocham et al., 2011).”
  9. Their experiments involve binary payoffs, document interaction of “…higher order WM and lower level RL components…” and show that accounting for WM can explain “…crucial aspects of behavioral variance…”
  10. “We further show that a genetic marker of prefrontal cortex function is associated with the WM capacity estimate of the model, whereas a genetic marker specific to BG function relates to the RL learning rate.”
  11. In the experiment there is binary with a presented signal and 3 possible responses (a 3-arm contextual bandit)
  12. They did genetic evaluation of the subjects
  13. Subjects did well in the task, with the last couple of trials/block being at >94% accuracy, and learning generally stabilized in 10 or less samples
  14. Different learning episodes varied in the size of the set that needed to be learned.  They considered impact of working memory in terms of load and delay, in terms of load, there may be a limit to the number of stimulus-response mappings that can be remembered, but in delay they consider the case where information may be cycling through, so consider the temporal connections between stimuli
    1. Considered how behavior diverged from optimal <regret!> in terms of delay.  This means they consider cases where the correct response was already given to a given stimulus
  15. Turns out that people were more likely to respond in error if the same stimulus was presented twice in a row than if there was a short delay between them <seems like the opposite of a switch cost?>
    1. “This indicated that, when the same stimulus was presented twice in a row, subjects were more likely to make an error in the second trial after having just responded correctly to that stimulus for lower set sizes. This finding may reflect a lower degree of task engagement for easier blocks, leading to a slightly higher likelihood of attentional lapses.”
  16. Logisitic regression done on the following variables: set size, delay since correct response to particular stimulus, total number of correct responses to stimulus
    1. Main effect of set size and correct repetitions
    2. Effect of delay was actually not a main effect but it did interact with # correct repititions
    3. “These results support the notion that, with higher set sizes as WM capacity was exceeded, subjects relied on more incremental RL, and less on delay-sensitive memory.”
  17. The logistic regression captured performance pretty accurately (mostly seems to have smoothed out the actual results in a good way) so gives a reasonable means to determine contributions of the various components to the final result
  18. Penalized increasingly complex models with Aikake’s information criterion
  19. The RL+WM model had the best fit “Thus, accounting for both capacity-limited WM and RL provides a better fit
    of the data than either process on its own (i.e. pure WM or pure RL models). Importantly, there was no trade-off in estimated parameters between the two main parameters of interest: capacity and RL learning rate, as would be revealed by a negative correlation between them”
  20. The capacity of WM estimated by the models was 3.7 +/- 0.14 which <kind of> agrees with standard estimtes
  21. <Basically skipping genetics part>
  22. ” The vast majority of neuroscientific studies of RL have focused on mechanisms underlying the class of ‘model-free’ RL algorithms; they capture the incremental learning of values associated with states and actions, without considering the extent to which the subject … can explicitly plan which actions to make based on their knowledge about the structure of the environment.”
  23. “It is clear that one such computational cost [of doing forward-search] is the capacity limitation of WM, which would be required to maintain a set of if⁄then relationships in mind in order to plan effectively … This secondary process is often attributed to the prefrontal cortex and its involvement in WM.”
  24. Here they show that WM has implications in even the simplest sort of behavior tasks, which models almost always don’t factor for
  25. When standard model-free RL algorithms are used to model behavior, their parameters naturally need to be searched over to fit the actual data.  Results here, however, show that they are really (partially) trying to adjust for capacity in WM as opposed to what they actually mean in the algorithm.   They are “… no longer estimates of the intended RL processes and are therefore misleading. Even when separate RL parameters were estimated for each set size in our experiment (an implausible, non-parsimonious model with 10 parameters), it did not provide as good a fit to the data as did our simpler hybrid model estimating WM contributions together with a simple process.”
  26. “The experimental protocol allowed us to determine that variance in behavior was explained separately by two characteristics of WM: its capacity and its stability. Indeed, behaviorally, learning was slower in problems with greater load, but there were minimal differences in asymptotic performance. Furthermore, although performance was initially highly subject to degradation due to the delay since the presented stimulus was last observed, this delay effect disappeared over learning.”
    1. Eventually the RL system supercedes the WM system
  27. Results here show that some components of commonly conducted RL experiments may be producing unanticipated influences on the results, also implications on whether policy is studied during learning or only after learning has occurred

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: