- Their extension for
**supervised SFA**is called graph-based SFA - “The algorithm extracts a label-predictive low-dimensional set of features that can be post processed by typical supervised algorithms to generate the final label or class estimation.”
- Trained with a graph where edge weights represent similarities
- The modification to
**SFA**made here is that it**accepts weights** - There are different ways of building these graphs, a very simple method generates results equivalent to the Fisher linear discriminant
- Claim is that supervised learning on high-dimensional data is tough, so often a dimension reduction step is taken (perhaps unsupervised).
- Here, a supervised dimension reduction step is proposed

- “GSFA and LE [Laplacian Eigenmaps] have the same objective function, but in general GSFA uses different edge-weight (adjacency) matrices, has different normalization constraints, supports nonde-weights, and uses function spaces.”
- GSFA can be used for both regression or classification, many approaches only work for one of the two
- “The central idea behind GSFA is to encode the label information implicitly in the structure of the input data, as some type of similarity matrix called edge-weight matrix, to indirectly solve the supervised learning problem, rather than performing an explicit fit to the labels.”
- In the graph, there are edge weights along with node weights, which specify a-priori sample properties
- “… hierarchical processing can also be seen as a regularization method because the number of parameters to be learned is typically smaller than if a single SFA node with the number of parameters than if a single SFA node with a huge input is used, leading to better generalization.”
- Another advantage is that if non-linear bases are used, the nonlinearity can allow for increasingly more complex functions per layer

- In graph edges are undirected, weighed, although it seems that the approach trivially generalizes to the directed case
**Basically they rewrite the original constraints of SFA with added weights**- Non-existing edges are given 0-weight
- Seems like they just end up using the graph to exactly calculate what the dynamics would be based on initialization probabilities (vertex weights) and transition probabilities (edge weights)
- How to construct the graph for either classification or regression is then discussed
- For graphs, they simply generate a separate graph for each class, with each item in each graph fully connected, and each sub-graph completely unconnected to items in a separate class, so basically there are independent fully connected components for each class
- There are some tricks that can be used due to the symmetry in each class cluster to make processing cheaper

- What is found by the classifier in this construction is equivalent to that of the Fisher linear discriminant
- “Consistent with FDA, the theory of SFA using unrestricted function space (optimal free responses) predicts that, for this type of problem, the first S – 1 slow features extracted are orthogonal step functions, and are piece-wise constant for samples from the same identity (…). This closely approximates what has been observed empirically, which can be informally described as features that are approximately constant for sample of the same identity, with moderate noise.”
- <Immediately next paragraph> “When the features extracted are close to the theoretical predictions (e.g., their Δ-values are small), their structure is simple enough that one can use even a modest supervised step after SFA, such as a nearest centroid or a Gaussian classifier (in which a Gaussian distribution is fitted to each class) on S-1 slow features or less.”
- Using SVMs over Gaussians doesn’t make performance that much better, while being computationally more expensive

- Now on to regression
- For regression “The fundamental idea is to treat labels as the value of a hidden slow parameter that we want to learn. In general, SFA will not extract the label values exactly. However, optimization for slowness implies that samples with similar label values are typically mapped to similar output values. After SFA reduces the dimensionality of the data, a complimentary explicit regression step on a few features solves the original regression problem.”
- They discuss 4 ways of doing the regression for SFA, the first one actually doesn’t even leverage
- In the version that doesn’t leverage graphs, simply sort data and then pass into SFA. “Due to limitations of the feature space considered, insufficient data, noise, etc., one typically obtains noisy and distorted versions of the predicted signals.”
- On the other hand, its the easiest to implement (partially because vanilla SFA can be used) so “… we recommend its use for first experiments.” If that doesn’t work well, use the GSFA approaches

- In the “sliding window training graph” items are sorted as above, but each vertex is connected to the
*d*closest left and right items - They recommend not using just 0 and 1 weights as it leads to “pathological solutions” – this may be what we’re picking up in ToH, and talk about why that happens. <This is worth investigating further.>
- In the “serial training graph,” data points are binned together – then points are all connected to points in adjacent bins, but they don’t connect all the points in a same bin <why?>
- As is the case in other particular structures, can set up GSFA to be more efficient for this particular case
- Naturally, there is tuning required to see that the binning was done correctly

- The “mixed training graph” adds connections within a bin
- Then there is a supervised step on top of this stuff <am I missing something – I thought there were 4 in total?>
- “There are at least three approaches to implement the supervised step on top of SFA to learn a mapping from slow features to the labels. ” <
- First option is linear or nonlinear regression
- To bin and then classify <so you end up with discrete approx of regression?>
- Do a weighted version of #2 so you get continuous estimations
- <#s 2 and 3 immediately above look terribly hacky, if I am groking them correctly>

- Experimental results
- For classification they only check to see that indeed SFA does the same thing as Fisher linear discriminant (because that has already been studied exhaustively), which it does
- Interestingly in the benchmark task used, convnets are best, and outperform humans

- In the regression problems they take photos of people and estimate the horizontal position of the face, vertical position, and size. This is all done separately <why? Ah, because the sorting depends on the output variable, so you can only sort according to one… although it seems like a pretty simple extension could handle higher-dimensional outputs>
- Take face pictures from a number of data sets (a total of 64,471) and were “… automatically pre-processed through a pose-normalization and pose-reintroduction step. Basically they are all centered and then from there shifted and zoomed according to a distribution. This way, they know what the x,y,z values they are estimating are
- Because of the size of the corpus and images themselves, its difficult to apply algs like SVMs directly, so they use hierarchical SFA and GSFA (which they also call HSFA <- great>)
- They also do a hierarchical version of PCA, which sort of does the opposite thing of SFA. The 120 HPCA features used explain 88% of the variance

- Used a few different post-dimension reduction classifiers, including SVM
- The slow features of the data gets organized in a more orderly fashion as go up in hierarchy
- “… GSFA was 4% to 13% more accurate than the basic reordering of samples employing standard SFA. In turn, reordering was at least 10% better than HPCA for this dataset.”
- Only 5 HSFA features are used, whereas 54 for HPCA. “This can be explained because PCA is sensitive to many factors that are irrelevant to solve the regression problem, such as the vertical position of the face, its scale, the background lighting, etc. Thus, the information that encodes the horizontal position of a face is mixed with other information and distributed over many principal components, whereas it is more concentrated in the slowest components of SFA.”
- Mixed and serial (straight SFA) outperformed the sliding window graphs <they were surprised but I’m not, at least with regards to mixed as it regular sliding window just seems like a strange construction). The serial was actually better than the mixed, although the difference wasn’t significant
- They call these approaches “implicitly supervised” because the construction of the graph depends on the supervised labels, but the algorithm never sees those labels explicitly
- “The experimental results demonstrate that the larger number of connections considered by GSFA indeed provides a more robust learning than standard SFA.”
- Knock unsupervised dimension reduction by doing dimension reduction that doesn’t necessarily help in the task you are actually interested in <But this is only “implictly” supervised, by the same logic fully supervised dimension reduction would be better yet.>
- Being able to simply specify a graph means there is no need to exhaustively harvest data from a graph you may already have, as is the case in standard SFA
- GSFA has a tendency to overfit because it is not regularized, and is sensitive (in a bad way) to multiple types of data being used