A Theoretical Basis for Emergent Pattern Discrimination in Neural Systems Through Slow Feature Extraction. Klampfl, Maas. Neural Computation 2010.

  1. Shows equivalence of slow feature analysis to Fisher linear discriminant
  2. “We demonstrate the power of this principle by showing that it enables readout neurons from simulated cortical microcircuits to learn without any supervision to discriminate between spoken digits and to detect repeated firing patterns that are embedded into a stream of noise spike trains with the same firing statistics.  Both these computer simulations and our theoretical analysis show that slow feature extraction enables neurons to extract and collect information that is spread out over a trajectory of of firing states that last several hundred ms.”
  3. Also can train neurons to keep track of time themselves
  4. Much of the learning the brain does in unsupervised.  SFA works in that scenario
    1. Intuition is that things that occur nearny in time are probably driven by the same underlying cause
  5. Perceptual studies show that slowness can form “…position- and view-invariant representations of visual objects in higher cortical areas.”
    1. Images were altered during blindness caused by a saccade, and the perception was that the two merged
    2. The authors of a related work <Li and DiCarlo – DiCarlo seems to be the common element of that work> say “unsupervised temporal slowness learning may reflect the mechanism by which the visual stream builds and maintains object representations.”
    3. Studies of IT in monkeys shows similar evidence
  6. This work builds on the theoretical basis of how slowness can be responsible for such effects
  7. Work by Sprekeler, Michaelis, Wiskott (07) also show SFA to be equivalent to PCA of a low-pass filter
    1. “In addition, they have shown that an experimentally supported synaptic plasticity rule, spike-timing-dependent plasticity (STDP), could in principle enable spiking neurons to learn this processing step without supervision, provided that the presynaptic inputs are processed in a suitable manner.”
  8. “… we show that SFA approximates the discrimination capability of the FLD in the sense that both methods yield the same projection direction, which can be interpreted as a separating hyperplane in the input space.”
  9. Intuitions drawn from SFA may help explain some phenomena that is sometimes not so well covered by “… the classical view of coding and computation in the brain, which was based on the assumption that external stimuli and internal memory items are encoded by firing states of neurons, which asssign a certain firing rate to a number of neurons maintained for some time interval.”
    1. In particular, trajectories in the brain are likewise encoded by trajectories of firing over a several hundred ms.
    2. The firing in terms of neural trajectories has been found in terms of static stimuli (such as odors and tastes) as well as things that are naturally time-varying (such as auditory or visual stimuli)
    3. Firing of sequences of neurons also true in cases when considering traces of episodic memory in hippocampus and cortex
  10. Being that there is neural activity that works in a temporally dispersed manner, perhaps SFA has something to say about how the brain actually uses that neural activity
  11. <There are also simulated neural results, but details are too complex to outline here so will address those points when I get to that section fully>
  12. A nice result of the neural experiments, and general properties of SFA is that when used over temporal data it behaves in an anytime sense (always producing whatever slow signals are most appropriate), so items that depend on its outputs may begin to formulate responses quickly
    1. This is especially important when set up in a hierarchical manner
  13. For the FLD part, consider linear basis functions only
  14. For classification, SFA has an additional parameter p, which dictates the probability of showing items of different classes back-to-back (needs probability of seeing in-class temporal pairs to be better than chance, but seems to be surprisingly robust to values anywhere in that range)
  15. In case where eigenvalues are nonzero <which should be the case?> the SFA objective and FLD are equivalent
  16. SFA/FLD can also be extended to the multiclass setting
  17. “The last line of equation 2.19 is just the formulation of FLD as a generalized eigenvalue problem… More precisely, the eigenvectors of the SFA problem are also eigenvectors of the FLD problem; the C-1 [where C is the number of classes] slowest features extracted by SFA applied to the time series x_t span the subspace that optimizes seperability in terms of he FLD… The slowest feature… is the weight vector that achieves maximal separation…”
  18. When used as a classifier, the optimal responses should look like a step function
  19. They then move onto the case where the trajectory provided to SFA is not made of isolated points, but rather sub-trajectories assembled into a larger trajectory
  20. The idea is to model what happens when sequences of neural firing occurs, such as is the case when episodic experience is reviewed mentally
  21. <Still a little unsure of the exact formulation of the input in this section>
  22. In an example where class means are similar, FLD produces the “right” linear separator, while SFA chooses one that is basically degenerate (each half-space consists about 50-50 of points from both classes, even though it is linearly separable)
  23. In this case, SFA and FLD produce different results, both analytically as well as empirically
    1. This is because “… the temporal correlations induced by the use of trajectories have an effect on the covariance of temporal differences in equation 2.24 compared to equation 2.12”
  24. <Again for trajectory of trajectories case> “… even for a small value of p, the objective of SFA cannot be solely reduced to the FLD objective, but rather there is a trade-off between the tendency to separate trajectories of different classes… and the tendency to produce smooth responses during individual trajectories…”
  25. In the vanilla SFA classification construction, SFA should really see transition between all pairs of points in the same class equally often; when trajectories are used this starts to fall apart because most of the time-pairs are within a single instance as opposed to between two items in the same class; as the sub-trajectories become long they dominate the features constructed so the ability to do classification is lost
  26. Linear separators aren’t strong, but as dimension increases the ability to do linear separation becomes increasingly similar <although I would also argue that as dimension increases risk of overfitting comes along with it>
    1. Not only does separability increase, margin does as well
    2. “In other words, a linear readout neuron with d presynaptic units can separate almost any pair of trajectories, each defined by connecting fewer than d randomly drawn points.”
  27. Now moves onto “… SFA as a possible mechanism for training readouts of a biological microcircuit.”
  28. Based on previous discussion, when data points themselves are trajectories, each slow feature itself wont classify data (as it does in the vanilla case): “… we predict that the class information will be distributed over multiple slow features.”
  29. high-order slow feature is one that is fast
  30. Now getting on to their simulated neural results.  Not sure how SFA ties in exactly yet
  31. They run SFA on the output of the neural circuit
  32. “We then trained linear SFA readouts on the 560-dimensional circuit trajectories, defined as the low-pass filtered spike trains of the spike response of all 560 neurons of the circuit…”
    1. The signal passed in consists of a sequence of static input, with poisson noise in between
    2. At first glance, tehre slow features don’t seem to respond much to the distinction between signal/noise
    3. But on average, the first 2 slow features don’t respond at all to the noise, while responding with a noticeable pattern to the signal.
  33. “One interesting property of this setup is that if we apply SFA directly on the stimulus trajectories, we basically achieve the same result [as they do when running on the neural respones].  In fact, the application to the circuit is the harder task because of the variability of the response to repeated presentations  of the same pattern and because of temporal integration: the circuit integrates input over time, making the response during a pattern dependent on the noise input immediately before the start of the pattern.”  Noise in the circuit takes a little while to die down after the static signal is introduced, which makes it harder to pick up the static signal
    1. <I don’t really get this, because earlier in the paper they said that SFA is bad for classifying items that manifest as a time-series.  I though the point of using a neural circuit was to try and circumvent that issue, but here they are saying it works better without the neural circuit?>
  34. Now moving on to recognition of spoken digits
  35. “We preprocessed the raw audio files with a model of the cochlea (Lyon, 1982) and converted the resulting analog cochleagrams into spike trains that serve as input to our microcircuit model (see section B.3.2 for details).”
  36. Classification between digit utterances of words “one” and “two”
  37. <Not groking their training methodology for this section – not so clearly written and I probably need to eat lunch.  They mention training two different times but I’m not clear on what the purpose of what this distinction is>
  38. “We found that the two slowest features, y_1 and y_2, responded with shapes similar to half sine waves during the presence of a trajectory… which is in fact the slowest possible response under the unit variance constraint.  Higher-order features partly consisted of full sine wave responses, which are the slowest possible responses under the additional constraint to be decorrelated to previous slow features.”
  39. <Immediately following in next P> “In this example, the slowest feature y_1 already extracts the class of the input patterns almost perfectly: it responds with positive values for trajectories in response to utterances of digit 2 and with negative values for trajectories of digit 1 and generalizes this behavior to unseen test examples.”
  40. The first slow feature is closest to the FLD
  41. The first slow feature encodes what (digit), while the second slow feature encodes where (corresponding to a position in the trajectory identified by SF1).  This has been found with other application of SFA (Wiskott Sejnowski 02).  Other faster features encode a mixture of the two
  42. A linear classifier on the SFAs (itself with a linear kernel) is very effective (98%)
  43. Training FLDs and SVMs on the same data as SFA results in poorer performance <Hm.  SVMs are pretty powerful – wonder why this result is what they have>
  44. They then go onto the same classification problem but with more data: “Due to the increased number of different samples for each class (for each speaker, there are now 10 different digits), this task is more difficult than the speaker-independent digit recognition.”
    1. “No single slow feature extracts What-information alone; the closest feature to the FLD is feature y_3.  To some extent also, y_4 extracts discriminative information about the stimulus.”
    2. “In such a situation where the distance between the class means is very small, the tendency to extract the trajectory class itself as a slow feature becomes negligible. In that case, the theory predicts that SFA tries to distinguish each individual trajectory due to the decorrelation …”
  45. In this situtation, SFA can be used as a preprocessor, but is not so useful as a classifier itself
  46. In some sense SFA is poorly suited to direct application on neural bursting information because such activity is inherently non-slow <Here maybe they low-pass filter it first?>
  47. When trying to do classification of time-series (sequences), “… the optimization problem of SFA can be viewed as a composition of two effects: the tendency to extract the trajectory class as a slow feature and the tendency to produce a smooth response during individual trajectories.”
  48. “In the context of biologically realistic neural circuits, this ability of an unsupervised learning mechanism is of particular interest because it could enable readout neurons, which typically receive inputs from, a large number of presynaptic neurons of the circuit, to extract from the trajectory of network states information about the stimulus that has caused this particular sequence of states–without any ‘teacher’ or reward.”
  49. Earlier work by Berkes showed similar results between SFA and FLD for handwritten digit recognition, but the two were not formally linked in that work
  50. <An excellent paper with tons of good references – worth rereading.>

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: