- “We prove that, given input patterns belonging to
*C*non-overlapping classes and a large enough function space, the optimal solution consists of*C*-1 output signals that are constant for each individual class. As a consequence, their output provides a feature space suitable to perform classification with simple methods, such as Gaussian classifiers.” <Not sure what this means yet> - Then the approach is applied to the handwritten digit database, and results are consistent with other prominent methods
- SFA was originally designed for work on time series data, but idea here is that it can be used for pattern recognition on non-time series data <I guess this makes it supervised as opposed to unsupervised> by simply presenting data of the same class one after the other in pairs. In this paper the application is for handwritten digit recognition, so they would show one example of an “8” followed by another “8”, and then a “5” followed by a “5” and so on.
- <I imagine there has to be a way to delineate between each pair because I guess you wouldnt want the “8” to spill into the “5”, although maybe it would work anyway because its trying to extract only the slowly changing features, which would exist more when looking at two presentations of the same digit.>

- The time-variance, Δ(
*y_j*) <not sure what the*j*index is for, but i think its the*j-*th dimension of the output? anyway, Δ(*y_j*)> =**w**_j^T**A****w_**j - And <
*y_i y_j*>_*t*=**w**_j^T**B****w_**j (this should = 0 which means outputs are decorrelated) - Based on points 4 and 5 and the requirement for unit variance,
- Δ(
*y_j*) = (**w**_j^T**A****w_**j)/(**w**_j^T**B****w_**j) - The weight vectors
that minimize this correspond to the eigenvectors of**w**_j**AW=BWΛ**- Where
(**W**=**w**_1, …,**w**_N) and**Λ**

- Where
**w**_*j*can be normalized s.t. the remaining constraints are met:**w**_j^T**B****w_**j = 0**w**_i^T**B****w_**j = 1

- Δ(
- Based on all this, Δ(
*y_j*) = λ_*j*, and then sorting the Eigenvectors by increasing Eigenvalues the ordering has the property that the most slowly varying signals have the lowest indices - Every function
*g*in a function space*F*can be expressed as a combination of bases*h*_1, … ,*h*_*M*over*F*- So, g(
**x**) = \sum_k=1^M*w_k h_k*(**x**) =*w**^T***h**(**x**)

- So, g(
- The algorithm:
**z**=**h**(**x**) –**h**_0 (“nonlinear expansion”), for**h**(**x**) = (*h_1*(**x**), …,*h_M*(**x**))**h**_0 = <**h**()>_*x**t*, where <> is the dot product

**AW = BWΛ**(“slow feature extraction”), where**A**= <\dot{**z**}\dot{**z**}^T>_*t***B**= <**z****z**^T>_*t*- “The
*K*eigenvectors**w**_1, … ,**w**_k (*K*<=*M*) corresponding to the generalized eigenvalues λ_1 <= λ_2 <= , … , <= λ_*K*define the nonlinear input-output functions*g*_1() , … ,**x***g*_*K*() \in**x***F:*(*g*_j() =**x****w**_j^T**h**(**x**) –**h**_0)” which does the minimization and satisfies constraints - <Immediately next> “In other words to solve the optimization problem (1) [minimizing change over time] it is sufficient to compute the covariance matrix of the signals and that of their derivatives in the expanded space and then solve the generalized eigenvalue problem (citation).”
- So there is the same basis functions for each element of the output, but each one has a different set of weights (as determined by the Eigenvector), where the weights (
*K*of them) are the top in terms of Eigenvalue of all*M*of them

- Right so the tweak to get it to do classification is just to present pairwise elements from the same class as if one followed the other in a time series
- <
**I actually think that that this use of SFA makes much more sense than the original desgin which learns on time series but then evaluates not on a time series but static inputs.**> - One problem of the method is that the matrix
**A** - “For a given input and a fixed function space, the approach proposed above has no parameters (with the exception of the number of derivatives used to estimate
if the number of patterns per class is too high [which in terms of the pure algorithm doesn’t matter anyway as you should use everything] ), which is an important advantage with respect to other algorithms that need to be fine-tuned to the considered problem. Moreover, since SFA is based on an eigenvector problem, it finds the global solution in a single iteration has no convergence problems… On the other hand, SFA suffers from the curse of dimensionality, since the size of the covariance matrices**A**and**A**that have to be stored during training grows rapidly with increasing dimensionality of the considered function space, which limits the number of input components (see also Sect. 4.3).”**B** - Onto experimental results – mentions work by LeCun
- Because of computational issues related to
, (which also has a D^2 component related to the order of the polynomial) they use PCA to project the problem to a smaller space where it can be worked on**A**- <
**It is somewhat ironic to project the problem down to a lower dimensional space only to project it into a higher dimensional space later with the basis functions**>

- <
- Because there are only 10 digit classes, only the 9 most slowly changing features are really useful
- To actually do classification they basically make a prediction of probability of belonging to a class based on normalized predictions of classifiers for each class
- Have to train classifiers separately for each class <this is a weakness>

- The best results are with a degree-3 poly and 35 input dimensions (error rate of 1.5%, which seems quite good). Some errors seems to be on instances that are actually tough for a person, but others may be due to information that is discarded during PCA
- KNN gets error rate of 5%
- SVNs get 1%
- LeCun’s stuff gets 1.7 or 0.95 depending on which version

- They say
**A****B** - “The learned feature space has a very small dimensionality, such that classification can be performed efficiently in terms of speed and memory requirements.”