Read this because this paper is pointed to as evidence of slow features doing object recognition, but really there is very little mention of SFA.
- Considers object recognition, and “…show that the primate ventral visual processing stream achieves a particularly effective solution in which single-neuron invariance is not the goal.”
- “Because brains compute with neurons, the subject must have neurons somewhere in its nervous system — ‘read-out’ — neurons – that can successfully report if objectg A was present […].”
- “The central issues are: what is the format of the representation used to support the decision (the substrate on which the decision functions operate); and what kinds of decision functions (i.e. read-out tools) are applied to that representation?”
- “… we treat object recognition fundamentally as a problem of data representation and re-representation, and we use simple decision functions (linear classifiers) to examine those representations.”
- Object recognition is fundamentally difficult, because “… each image projected into the eye is one point in an ~1 million dimensional retinal ganglion cell representation (…).”
- Manifolds: “… all the possible retinal images that face could ever produce… form a continuous, low-dimensional, curved surface inside the retinal image space called an object ‘manifold’…”
- Consider two potential image manifolds that “… do not cross or superimpose — they are like two sheets of paper crumpled together… We argue that this describes the computational crux of ‘everyday’ recognition: the problem is typically not a lack of information or noisy information, but that the information is badly formatted in the retinal representation — it is tangled (…).”
- “One way of viewing the overarching goal of the brain’s object recognition machinery, then, is as a transformation from visual representations that are easy to build (e.g. center-surround filters in the retina), but are not easily decoded … into representations that we do not yet know how to build (e.g. representations in IT), but are easily decoded…”
- “… single neurons at the highest level of the monkey ventral visual stream — the IT cortex — display spiking responses that are probably useful for object recognition. Specifically, many individual IT neurons respond selectively to particular classes of objects, such as facesor other complex shapes, yet show some tolerance to changes in object position, size, pose, and illumination, and low-level shape cues.”
- In <what I suppose is actual> neural responses from 200 neurons in IT (the highest part of the visual system). Simple linear classifiers on activity were robust to variation in object position and size. More sophisticated classifiers didn’t improve performance much, and were ineffective when applied directly to v1 activity.
- V1 doesn’t allow for simple separation and identification of objects but IT does.
- The classical approach to understanding visual processing is to consider the action of individual neurons — here they consider more the activity of entire populations in the representation of visual images
- According to this perspective, the best way to build object representations is to find a way to separate manifolds
- Because populations are considered, it is not the case that a single neuron must represent things (although if a single neuron can split the manifolds that is great). Because “… real IT neurons are not, for example, position and size invariant, in that they have limited spatial receptive fields <as opposed to wide spatial receptive fields, as would be required if single neurons were actually responsible for representing invariances>[…]. It is now easy to see that this ‘limitation’ <of relying on populations of neurons as opposed to single neurons> is an advantage.”
- Manifold flattening may be hard wired, but it may also be the case that the means to do this is learned from the statistics of natural images. <The latter must play at least part of the role>
- Progressive flattening by increasing depth in visual system
- “… we believe that the most fruitful computational algorithms will be those that a visual system (natural or artificial) could apply locally and iteratively at each cortical processing stage (…) in a largely unsupervised manner(…) and that achieve some local object manifold flattening.”
- In the actual visual system, each layer projects information into a spaces of increasing dimension “… e.g. ~100 times more v1 neurons than retinal ganglion neurons …” as data gets projected to higher dimensions (if the projection is good) it makes simpler methods of analysis and classification more effective
- Additionally, at each stage the distribution of activity is “… allocated in a way that matches the distribution of visual information encountered in the real world …” which means the projections into higher dimensions is actually using the additional space reasonably
- The addition that temporal information helps untangle manifolds: “In the language of object tangling, this is equivalent to saying that temporal image evolution spells out the degrees of freedom of object manifolds. The ventral stream might use this temporal evolution to achieve progressive flattening of object manifolds across neuronal processing stages. Indeed, recent studies in our laboratory … have begun to connect this computational idea with biological vision, showing that invariant object recognition can be predictably manipulated by the temporal statistics of the environment.”
- This is where SFA is cited