Incremental Slow Feature Analysis: Adaptive Low-Complexity Slow Feature Updating from High-Dimensional Input Streams. Kompella, Luciw, Schmidhuber. Neural Computation 2012.

<Although I expect it has fair overlap with their lab’s other paper on incremental SFA, reading this as I’m trying to use the related software package.  Will go lighter on the details where there is repetition.>

  1. Incremental slow feature analysis (IncSFA) leverages:
    1. Candid covariance-free incremental principal components analysis (CCIPCA)
    2. Covariance-free incremental minor components analysis (CIMCA)
  2. IncSFA has linear cost with dimension, whereas standard SFA is cubic, doesn’t store or compute covariance matrices (for input dimension D, space savings of D2)
    1. Because of the lower cost related to dimension IncSFA can be run directly on high-dimensional input such as video/visual streams while standard SFA generally requires hierarchical structures in order to process high dimensional inputs
  3. On the other hand, IncSFA is less data-efficient per sample
  4. Experimental results show IncSFA develops similar features to batch SFA, and can tackle some problems standard SFA is unable to
  5. Slow features are lowest order eigenvectors of derivatives of input
  6.  Another advantage is that batch SFA is sensitive to outliers, but IncSFA deals more gracefully with outliers as well as non-stationary data
  7. “The state variables from SFA are approximations of low-order eigenvectors of the graph Laplacian (…) i.e., proto-value functions (…).  This is why they are typically more useful as features in reinforcement learning instead of other types of features, such as principal components.”
  8. In batch, goal is to find a weight vector w, that when combined with basis features recovers the slowly varying components
  9. Batch SFA uses principal component analysis twice:
    1. “Whitening [mapping of x to z s.t. it has 0-mean and identity covariance] requires the PCs of the input signal (PCA #1).”
    2. “The orthonormal basis that minimizes the rate of output change are the minor components – principal components with smallest eigenvalues – in the derivative space.  So, another PCA (#2) on ż yields the slow features (eigenvectors) and their order (via eigenvalues).”
  10. IncSFA uses online methods for doing these 2 forms of PCA
  11. To replace PCA in 9.1, it does incremental whitening, by using candid covariance-free incremental PCA (CCIPCA).  
    1. This algorithm “… incrementally updates both the eigenvectors and eigenvalues necessary for whitening, and does not keep an estimate of the covariance matrix.  It has been proven to converge to the true PCs (…).  The method can also be used for dimensionality reduction here if desired
  12. CCIPCA, on the other hand, generally cannot be used for the PCA in 9.2 (it can be used when |ż| is small), as it would take too long to converge to the slow features which are the least significant components
  13. Therefore, for PCA in 9.2, an algorithm called minor components analysis (MCA) is used. It also extracts principal components, but does it by finding the components that are “easiest to extract”, which have the smallest eigenvalues
  14. These incremental updates produce new estimates of W and (and naturally depend on the immediately previous estimates of them), where:
    1. W “…is the matrix of existing slow feature vector estimates for J slow features, where each feature is a column vector in the matrix”
    2. V “… is the matrix of K principal component vector estimates used to construct the whitening matrix and for dimensionality reduction…”
  15. After whitening, data is “… normalized in scale and decorrelated, as seen by the fact that the covariance matrix will be the identity matrix…”
  16. <They go through these two PCA algorithms, but I’m not taking notes on them>
  17. Algorithmic complexity is O(I + K + J2) where, I is the dimension of the input features, K is # eigenvectors after CCIPCA, and J is # of slow feature vectors.  Generally JKI
  18. First experiment is proof of concept from original SFA paper – recovers very similar features as batch SFA and the analytically correct slow features
    1. This is the example included in the code package
  19. Second example is a nonstationary environement (at a certain point they swap the input features so x_1 becomes x_2 and vice versa)- as an algorithm based on an update rule it deals with this better than batch methods.  It recovers after many fewer epochs than the initial training requires
  20. Next experiment deals with reduced sensitivity of IncSFA to outliers
    1. In particular, because what batch SFA trains for, it actually will fit specifically to the outliers, because they are something that occur rarely (and therefore averaged over time appear slow)
  21. The 4th example deals with video.  CCIPCA is used here not only for whitening, but also for dimension reduction
  22. In it, images are 41 x 41 x 3 of an agent rotating in a room (so the view is spinning around)
  23. Raw input is 5043D, but only 40 dimensions are used after PCA.  Of that, only 5 features are computed from CIMCA – got 50 fps on a <not powerful – although images are very low-res> Intel i3
  24. The result is a pretty much circular embedding of the data
  25. In experiment 5 looks like they use a real robot, and have it do a form of randomized motor activity until two cups placed in front of it fall over – 50 episodes of this
  26. Here input is greyscale 80×60 camera images – only 20 most significant features from CCIPCA are used, from there 5 slow features are captured.
  27. Experience replay is used – (50) episodes randomly until 400 presentations have occurred
  28. Slowness of learned features drops pretty fast and is pretty close to convergence after 50 presentations, looks basically flat by 100
  29. They embed 20 trajectories into just 2 dimensions and the data falls very well on a circular manifold, the embedding found just by PCA to 2 dimensions is much less clean
    1. IncSFA seems to be sensitive to the state of the cups (standing or fallen) while PCA does not
  30. “Such clear object-specific low-dimensional encoding, invariant to the robot’s arm position, is useful, greatly facilitating training of a subsequent regression”
  31. Move on to hierarchical IncSFA, although they put it as “Here, for completeness, we test IncSFA within a deep receptive field-field based network – at all layers – although the utility of this approach is unclear [my emphasis].”  <therefore I’m not going to take notes very carefully>
  32. “The slow features will be linear combinations of the input space components. Since there’s no guarantee the useful information is linear in the original sensory space, an expanded space is often used. For example, … a cubic expansion adds all triples. But the degree of expansion required to construct a space where the ‘interesting information’ will be some linear combination, may increase the dimension intractably…A succession of low-order expansions over multiple layers lead to an overall expansion which is high-order.
  33. [Immediately following] “Hierarchical networks introduce new parameters (receptive field size, number of layers, etc.) that can be difficult to tune.
  34. They also do a hierarchical experiment with data from a robot vision simulation, where there is a board moving forward and backward in front of the robot
  35. Here too, batch and incremental have similar results, although the results from incremental seem noisier
  36. On to supplementary topics
  37. CCIPA converges to true principal components under normal assumptions on learning rate
  38. Information about how learning rates can/should be scheduled
  39. “To prevent divergence while CCIPA is still learning, we used a slowly rising learning rate for MCA…”
  40. <Too cold in my office, rushing through to finish cause too little to leave for later, notes will be sparse here on out>
  41. Links to biological systems
  42. “The velocity estimates (the derivative signal) in the original SFA technique are approximated via a backward difference method ˙ z(t) = z(t) − z(t − 1). This method behaves badly in the presence of input noise compared to other methods (that are computationally expensive) such as higher order difference estimation, cauchy’s differentiation formula, or lanczos derivative computation etc. (Groetsch, 1998). However, noise is usually not a severe problem, since it changes at a faster time-scale compared to the slowest components and therefore does not show up in the higher-order slow features. Therefore, we opted the same backward difference method for the IncSFA to keep it computationally simple.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: