Pattern Recognition with Slow Feature Analysis. Berkes. CogPrints 2005.

1. “We prove that, given input patterns belonging to C non-overlapping classes and a large enough function space, the optimal solution consists of C-1 output signals that are constant for each individual class.  As a consequence, their output provides a feature space suitable to perform classification with simple methods, such as Gaussian classifiers.” <Not sure what this means yet>
2. Then the approach is applied to the handwritten digit database, and results are consistent with other prominent methods
3. SFA was originally designed for work on time series data, but idea here is that it can be used for pattern recognition on non-time series data <I guess this makes it supervised as opposed to unsupervised>  by simply presenting data of the same class one after the other in pairs.  In this paper the application is for handwritten digit recognition, so they would show one example of an “8” followed by another “8”, and then a “5” followed by a “5” and so on.
1. <I imagine there has to be a way to delineate between each pair because I guess you wouldnt want the “8” to spill into the “5”, although maybe it would work anyway because its trying to extract only the slowly changing features, which would exist more when looking at two presentations of the same digit.>
4.  The time-variance, Δ(y_j) <not sure what the j index is for, but i think its the j-th dimension of the output? anyway, Δ(y_j)>  = w_j^T A w_j
5. And <y_i y_j>_w_j^T B w_j  (this should = 0 which means outputs are decorrelated)
6. Based on points 4 and 5 and the requirement for unit variance,
1. Δ(y_j) = (w_j^T A w_j)/(w_j^T B w_j)
2. The weight vectors w_j  that minimize this correspond to the eigenvectors of AW=BWΛ
1. Where W = (w_1, …, w_N) and Λ  is the diagonal matrix of generalized Eigenvalues (λ_1, …, λ_N)
3. w_j can be normalized s.t. the remaining constraints are met:
1. w_j^T B w_j = 0
2. w_i^T B w_j = 1
7. Based on all this, Δ(y_j) = λ_j, and then sorting the Eigenvectors by increasing Eigenvalues the ordering has the property that the most slowly varying signals have the lowest indices
8. Every function g in a function space F can be expressed as a combination of bases h_1, … , h_over F
1. So, g(x) = \sum_k=1^M w_k h_k (x) = w^T h(x)
9. The algorithm:
1. z = h(x) – h_0 (“nonlinear expansion”), for
1. h(x) = (h_1(x), …, h_M(x))
2. h_0 = <h(x)>_t, where <> is the dot product
2. AW = BWΛ (“slow feature extraction”), where
1. = <\dot{z}\dot{z}^T>_t
2. = <zz^T>_t
3. “The K eigenvectors w_1, … , w_k (K <= M) corresponding to the generalized eigenvalues λ_1 <= λ_2 <= , … , <= λ_K define the nonlinear input-output functions g_1(x) , … , g_K(x) \in F: g_j(x) = w_j^T (h(x) – h_0)” which does the minimization and satisfies constraints
4. <Immediately next> “In other words to solve the optimization problem (1) [minimizing change over time] it is sufficient to compute the covariance matrix of the signals and that of their derivatives in the expanded space and then solve the generalized eigenvalue problem (citation).”
5. So there is the same basis functions for each element of the output, but each one has a different set of weights (as determined by the Eigenvector), where the weights (K of them) are the top in terms of Eigenvalue of all M of them
10. Right so the tweak to get it to do classification is just to present pairwise elements from the same class as if one followed the other in a time series
11. <I actually think that that this use of SFA makes much more sense than the original desgin which learns on time series but then evaluates not on a time series but static inputs.>
12. One problem of the method is that the matrix A has size polynomial in the input (all pairs), so if costs get too high may want to do random subsampling.  <Haven’t thought this through, but if you have to do a matrix inversion on this then there is a O(n^6) ((n^2)^3) memory cost which will obviously be problematic for even medium-sized data sets>
13. “For a given input and a fixed function space, the approach proposed above has no parameters (with the exception of the number of derivatives used to estimate A if the number of patterns per class is too high [which in terms of the pure algorithm doesn’t matter anyway as you should use everything] ), which is an important advantage with respect to other algorithms that need to be fine-tuned to the considered problem.  Moreover, since SFA is based on an eigenvector problem, it finds the global solution in a single iteration has no convergence problems…  On the other hand, SFA suffers from the curse of dimensionality, since the size of the covariance matrices A and B that have to be stored during training grows rapidly with increasing dimensionality of the considered function space, which limits the number of input components (see also Sect. 4.3).”
14. Onto experimental results – mentions work by LeCun
15. Because of computational issues related to A, (which also has a D^2 component related to the order of the polynomial) they use PCA to project the problem to a smaller space where it can be worked on
1. <It is somewhat ironic to project the problem down to a lower dimensional space only to project it into a higher dimensional space later with the basis functions>
16. Because there are only 10 digit classes, only the 9 most slowly changing features are really useful
17. To actually do classification they basically make a prediction of probability of belonging to a class based on normalized predictions of classifiers for each class
1. Have to train classifiers separately for each class <this is a weakness>
18. The best results are with a degree-3 poly and 35 input dimensions (error rate of 1.5%, which seems quite good).  Some errors seems to be on instances that are actually tough for a person, but others may be due to information that is discarded during PCA
1. KNN gets error rate of 5%
2. SVNs get 1%
3. LeCun’s stuff gets 1.7 or 0.95 depending on which version
19. They say A and B grow exponentially but I thought it was poly?
20. “The learned feature space has a very small dimensionality, such that classification can be performed efficiently in terms of speed and memory requirements.”