Reinforcement Learning on Slow Features of High-Dimensional Input Streams. Legenstein, Wilbert, Wiskott. PLoS Computational Biology 2010.

  1. It has been shown that the activity of dopaminergic neurons in the ventral tegmental area is related to the reward-prediction error… These neurons in turn have dense diffuse projections to several important areas including the striatum. In the striatum it was shown that dopamine influences synaptic plasticity
  2. Due to the curse of dimensionality, doing RL natively in the full sensory space we work in is probably not a reasonable explanation for how we are able to learn
  3. ” The autonomous extraction of relevant features in the nervous system is commonly attributed to neocortex. The way how neocortex extracts features from the sensory input is still unknown and a matter of debate.” 
    1. Some ideas are PCA, ICA, or slow feature analysis (SFA)
  4. SFA is nice because of the ability to stack it hierarchically
    1. (we note however that the characteristic recurrent organization of cortex where multiple loops provide feedback from higher-level to lower-level processing [in the visual system] is not yet exploited in hierarchical SFA architectures)”
  5. Furthermore, the features that emerge from SFA have been shown to resemble the stimulus tunings of neurons both at low and high levels of sensory representation such as various types of complex cells in the visual system […] as well as hippocampal place cells, head-direction cells, and spatial-view cells […].”
  6. Unsupervised learning based on the slowness principle (i.e., learning that exploits temporal continuity of real-world stimuli) has recently attracted the attention of experimentalists […]. It was shown in monkey experiments, that features in monkey inferotemporal cortex are adapted in a way that is consistent with the slowness principle […].”
  7. In experiments here, they take a 24025D input and reduce it to 64D (or less)
  8. 2 experiments, involving a fish moving in 2D
  9. In the first phase, training occurs on <passive?> dynamics
  10. Use hierarchical SFA in order to deal with the high-dimensionality of the data
  11. In the “three variables task,” there is a fish, a goal (a shaded circle) and then a distractor (d-pad shape).  In the other experiment, the type of fish changes and then the distractor (now the circle) and the goal (now the d-pad) are switched
  12. Noise is added to prevent singularities in the SFA step
  13. Basis functions are degree-2 polynomials
  14. <Oddly> they have the size of each progressive stage in the hierarchy go from 32 to 42 to 52
  15. Ultimately in the RL task the 32 slowest remaining features were outputted
  16. “… the hierarchical organization of the model captures two important aspects of cortical visual processing: increasing receptive field sizes and accumulating computational power at higher layers. The latter is due to the quadratic expansion in each layer, so that each layer computes a subset of higher polynomials than its predecessor.”
  17. The network layers were trained sequentially from bottom to top. We used 50,000 time points for the training of the two lower layers and 200,000 for the two top layers. These training sequences were generated with a random walk procedure…”
  18. For the RL part implemented Q-Learning and policy gradient
  19. We implemented the neural version of Q-learning from [35] where the Q-function is represented by a small ensemble of neurons and parametrized by the connection weights from the inputs to these neurons. The system learns by adaptation of the Q-function via the network weights. In the implementation used in this article, this is achieved by a local synaptic learning rule at the synapses of the neurons in the neuron ensemble.”
    1. <Guarantee hacks galore are needed to get this to work in this experiment>
  20. Q function in the ANN represented by 360 linear neurons, each geared toward one degree orientation.
    1. <Exactly how they work this out with the ANN is a little unclear>
  21. Most theoretical studies of such biologically plausible policy-gradient learning algorithms are based on point-neuron models where synaptic inputs are weighted by the synaptic efficacies to obtain the membrane voltage. The output yi(t) of the neuron i is then essentially obtained by the application of a nonlinear function to the membrane voltage.
  22. <skipping most details of the policy gradient implementation>
  23. The RL algorithms controlled speed and angular velocity
  24. ANN QL learns the task (without the obstacle/distractor) in about 40 episodes (not sure how long that is)
  25. With the distractor, 32 slow features are used
  26. The performance of slow features vs the natural encoding you would do by hand is pretty close
  27. “Since the outputs of the SFA network are naturally ordered by their slowness one can pick only the  first n outputs and train the reinforcement learning network on those.
  28. When using dimension reduction via PCA as opposed to SFA the performance suffered greatly
  29. Relationship of policy gradient to Hebbian learning rule<?>
  30. Given the high-dimensional visual encoding of the state-space accessible to the learning system, it is practically impossible that any direct reinforcement learning approach is able to solve the variable-targets task directly on the visually-induced state-space.
    1. <Its been done – “Stable Function Approximation in Dynamic Programming” by Gordon.>
  31. Another interesting possibility not pursued in this paper would be to sparsify the SFA output by ICA. This has led to place-cell like behavior in […] and might be beneficial for subsequent reward-based learning.”
  32. Also mentions information bottleneck

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: