- Pick out features that vary slowly from time-series data
- “It is based on a nonlinear expansion of the input signal and application of PCA to this expanded signal and its time derivative.”
- Sounds a little like the first part is SVMish

- It can be applied hierarchically to extract features
- It is used to model the visual system
- Can learn man different things (translation, rotation, contrast, etc…) based on the training set
- Doesn’t need a large training corpus
- “Performance degrades if the network is trained to learn multiple invariances simultaneously.”
- It has been common in ANN systems to build representations that have invariances in them
- Another method is to map different representations to each other in an invariant way (such as translation or size), but this needs to be set up in advance
- The approach here is different than the previous 2 approaches listed because it is based on learning invariances from temporal inputs
- The idea is that the perception of our environment varies slowly, but at a low-level changes occur quickly; this attempts to capture that phenomenon
- “…a slowly-varying representation can be considered to be of higher abstraction level than a quickly varying one.”

- “It is important to note here that the input-output function computes the output signal instantaneously, only on the basis of the current input.”
- <
**This seems like it must then throw out a great deal of useful information – basically its saying context doesn’t matter**>

- <
- When dealing with an object moving, being able to identify they object, and its location will allow motion to be described in a way that changes slowly (certainly w.r.t. individual pixels that change drastically as an object moves across them)
- Commonly, object identity changes slowly, but its position may change much more quickly

- The formal problem:
- Given
*I*-dimensional vector inputs*x*(*t*) - Find a function
*g*(*x*) ->*y*(*t*) of dimension*J*<shouldn’t the left side of the equation be*g*(*x*(*t*))?> - Such that Δ(
*y’*) is minimal - Under the additional constraints that:
*y_j*has 0-mean*y_j*^2 = 1 (unit variance)- <
*y_j,*> = 0 (decorrelation), where <*y_k**f*> denotes 1/(*t_*1 –*t*_0) \int_{t_0}^{t_1}*f*(*t*)*dt*<integral>

- 13.3 means output variation should be minimal
- 13.4.1 and 2 add constraints so output cannot be constant
- 13.4.3 ensures each signal component does not reproduce another
- It says there is an order induced such that Δ(
*y_j*) <= Δ() if*y_k**j*<*k*. <how does this fall out of the constraints?>

- Given
- “This learning problem is an optimization problem of variational calculus and in general is difficult to solve.” If however, the function for each
*g_j*are a liner combination of nonlinear functions, the problem is easier to solve - The algorithm to do the optimization is here
- All
*j*in the output are a linear combination of the same nonlinear components; only the weights change - Ah, there is a brief mention to the similarity to SVM here.
- Ah, they mention that if the nonlinear basis functions are such that the product of the basis functions and
*x*are zero-mean and unit covariance then the constraints are satisfied iff the weight vectors are orthonormal - The solution to this problem is an Eigen-decomposition problem
- The algorithm is then refined to normalize the input signals
- Propose using degree 1 or 2 basis functions (linear or quadratic)
- And then normalize the output so the results are zero-mean and have identity covariance
- They call this
*sphering*or*whitening*, and the matrix that accomplishes this can be arrived at by PCA

- They call this
- Then do PCA
*again*on a matrix that is computed from the whitened outputs- The
*J*Eigenvectors with the lowest Eigenvalues give the normalized weight vectors - This produces the output function

- The
- When testing, inputs must be normalized in the same manner as the data in the training set
- “For practical reasons, SVD is used in steps 4 and 5 [22 and 23] instead of PCA. SVD is preferable for analyzing degenerate data in which some eigenvalues are very close to zero, which are then discarded in step 4 [22]. The nonlinear expansion sometimes leads to degenerate data, since it produces a highly redundant representation where some components may have a linear relationship… [more stuff of the like, and numeric weirdness ].”
- <I was wondering where the ANN part of this comes up, now it is explained> they consider the basis functions to make up the hidden layer, and the weights on the basis functions the weights between the hidden and output layers.
- <They actually propose two different ANNs that would do the job. This explanation is a bit interesting as it unifies a couple of approaches, but on the other hand, the ANN treatment is inelegant and I don’t think conveys what is going on well, and has nothing do do with the actual implementation.>

- “Its useful to measure the invariance of signals not by the value of Δ directly but by a measure that has a more intuitive interpretation.” Then a measure they propose is defined
- <
**How do they define Δ exactly though? Is it based on integration though? I hope not… Oh I suppose we are probably working in discrete time so we don’t need/can’t do more sophisticated methods of integration**> - Going back to the point about different possible ANNs that do this (27.1) they say the type of network depends on what basis functions are used <So again nice to see the connection to ANNs but the connection isn’t really useful in manner more than conceptual and shouldn’t be stressed too much>
- One example of a particular case of SFA is about two things:
- “… learning response behavior of complex cells based on simple cell responses… “
- “…estimation of disparity and motion.”

- Then there will be a more sophisticated example that requires chaining of 3 SFAs
- This results in translation invariance

- These are related to problems our visual system may deal with, but no claims here are made about biological plausibility
- Implementation done in Mathematica <Says something interesting about the authors.>
- The first 2 examples model 5 monocular simple cells that are modeled by Gabors.
- Data size 512 points
- <The input dimension and size of corpus seem pretty small, but the paper is from 2002 so I should probably give them a break.>
- Δ
*t*is a fixed amount - “The amplitude and phase modulation signals are low-pass filtered guassian [Gaussian] white noise normalized to zero mean and unit variance”
- <Not sure what this means in the context of the experiemnt>

- The experiment is set up so that 1 of the 5 simple cells has a different orientation and phase than the others so it is independent, and is designed to be a distractor which should be ignored by SFA
- <I can’t make any sense out of the graphs they have>
- <I also have to admit to not caring so much about this particular application>
- In this experiment, the results are said to be good because a degree-2 poly captures the slow features well. The third example is supposed to be harder.
- In example 3 they have a few polynomial layers, leading to a sparse higher dimensional polynomial <Can’t you just do this with one layer and sparisfy it in the same way there?>
- “The algorithm can extract not only slowly varying features but also rarely varying ones.”
- Then move onto a model of a 1D retina with 65 sensors and 2 SFA layers
- <why is everything low-pass filtered?>
- They had to clip values of the outputs for each SFA layer, which was needed to prevent significant errors in extrapolation. Aside from that overfitting wasn’t a problem
- <Really, no idea what the graphs are depicting>
- Different parts can do either: translation invariance, what, or where information
- <I’m mostly skipping over about half of this paper – mostly in-depth discussion of results>
- <Ok, picking back up at the conclusion>
- “SFA is somewhat unusual in that directions of minimal variance rather than maximal variance are extracted.” Argue that perhaps without normalization this would be expected to pick up on noise, but because of normalization the signal actually causes less variation than actual noise <?> and so therefore noise is actually discarded.
- Slowly-varying noise though, is susceptible to getting picked up, in this case low-pass filtering may help

- <It would be interesting instead to try and extract maximal information (in the information theoretic sense) instead of minimizing variance. Apparently some connections like this have been made with the algorithm (Shaw, 2003; Creutzig and Sprekeler, 2008), but using an information-theoretic criteria seems like it makes the most sense. I think that any approach that focuses simply on variation and not information (such as PCA) has some potential pitfalls – variance (small or large) doesn’t matter, its information. Sometimes variation contains information, but sometimes it does not.>
- Although many invariances are found says they failed to find a similarity measure in one of the experiments
- Actually here the point is made that doing an information-theoretic optimization does not really change the algorithm (the objectives are left almost the same)
- The information theoretic approach is difficult in the case of continuous inputs/outputs
- <Ok this makes me feel better>