- “Here, we use the information bottleneck method to state an information-theoretic objective function for temporally local predictive coding. We then show that the linear case of SFA can be interpreted as a variant of predictive coding that maximizes the mutual information between the current output of the system and the input signal at the next time step.”
- “One approach for the self-organized formation of invariant representation is based on the observation that objects are unlikely to change or disappear completely from one moment to the next.”
- SFA reproduces some invariances in the visual system as well as “… properties of complex cells in primary visual cortex (…).”
- Including place cells

- Predictive coding basically means model building of the environment
- A big part of this biologically is “redundancy reduction”

- This paper sets up an information-theoretic framework for predictive coding
- “We focus on gaussian input signals and linear mapping. In this case, the optimization problem underlying the information bottleneck can be reduced to a eigenvalue problem (…). We show that the solution to this problem is similar to linear slow feature analysis, thereby providing a link between the learning principles of slowness and predictive coding.”
- In the information bottleneck, “One seeks to capture those components of a random variable
*X*that can explain observed values of another variable*R.*This task is achieved by compressing the variable*X*into its compressed representation*Y*while preserving as much information as possible about*R.*The trade-off between these two targets is controlled by the trade-off parameter β.” - The problem is solved by minimizing a Lagrangian, where the first term minimizes the complexity of the mapping and the second maximizes accuracy of the representation
- “From the point of view of clustering, the information bottleneck method finds a quantization, or partition, of
*X*that preserves as much mutual information as possible about*R*.” - IB has been used for document clustering, nerual code analysis, gene expression analysis, extraction of speech features.
- “In particular, in case of a linear mapping between gaussian variables, the optimal functions are the solution of an eigenvalue problem (…).”
- “SFA is guaranteed to find the slowest features first, whereas TLPC [temporally local predictive coding] finds the most predictive components first. For example, a very fast component can be very predictive, for example, if the value at
*t*+1 is the negative of the current value (…). Hence, from the TLPC point of view, the absolute deviation from random fluctuations rather than slowness is relevant.” Although this distinction is only really relevant for discrete data with fine time resolution; in continuous spaces they should be much more closely related- The relationship between them is λ
_{i}^{TLPC}= λ_{i}^{SFA}– 1/4 (λ_{i}^{SFA})^{2}, for eigenvalues λ

- The relationship between them is λ
**TLPC weighs each feature while SFA does not.**- TLPC “… and SFA find the same components in the same order <when assumptions hold?>. The difference is that TLPC allows quantifying the components in terms of predictive information… slow feature analysis accredits the same amplitude to all components, while TLPC gives higher weights to slower components according to their predictive power.”
- <But in SFA you do get the eigenvalues of the eigenvectors that can be used for weighing? The goal is to find the lowest-eigenvalued eigenvectors. Why can’t you use those weights, or do they simply have a less direct interpretation?>

- “The relationship between predictive coding and temporal invariance learning has also been suggested in other work, for example, by Shaw (2006), who argued that temporal invariance learning is equivalent to predictive coding if the input signals are generated from Ornstein-Uhlenbeck processes.”
- “In one regard, temporally local predictive coding differs from slow feature analysis. The information bottleneck approach is continuous in terms of the trade-off parameterβ, and new eigenvectors appear as second-order phase transitions. The weighting of the eigenvectors is different in that it depends on their eigenvalue (see Figure 3). This can be important when analyzing or modeling sensory systems where available bandwidth and, hence, resulting signal-to-noise ratio, is a limiting factor. For temporally local predictive coding, available bandwidth, such as number of neurons, should be attributed according to relative amplitude, whereas slow feature analysis accredits the same bandwidth to all features.“
- <Immediately following> “We emphasize that our approach is not directly applicable to many realworld problems. Our derivation is restricted to gaussian variables and linear mappings.“
- “The restriction on the immediate past implies that SFA does not maximize predictive information for non-Markovian processes.” <They claim followup work on this in progress at time of publication>