On the Relation of Slow Feature Analysis and Laplacian Eigenmaps. Sprekeler. Neural Computation 2011.


  1. “Laplacian eigenmaps (LEMs) [have been used] for nonlinear dimensionality reduction… spectral clustering, in supervised learning, and for providing efficient state representation for reinforcement learning.  Here, we show that LEMs are closely related to slow feature analysis (SFA)…  SFA can be interpreted as a function approximation of LEMs, where the topological neighborhoods required for LEMs are implicitly defined by the temporal structure of the data.”
  2. Based on this relationship, SFA is generalized to arbitrary neighborhood relations
  3. LEMs (also called diffusion maps) are designed to maintain neighborhood relationships in data, while performing dimension reduction
  4. “LEMs find an embedding of the data in a low-dimensional space by constructing a graph on the data and finding the eigenvectors of the associated graph Laplacian.”
    1. Ng and Jordan have a paper about it for clustering <I think I read it a while back>
  5. A problem of LEM is that the dimensionality of the graph Laplacian scales with the number of data points, so the eigen decomposition becomes very expensive for large data sets
    1. There are some tricks to overcoming this issue
  6. “Here, we show that for temporally structured data, the optimization problem of the LEM method is formally equivalent to slow feature analysis (SFA), a nonlinear signal processing algorithm that aims at minimizing temporal variations in the output signals.  The neighborhood function in the associated LEM problem is given by transition probabilities and is therefore implicitly defined by the temporal structure of the data.  We then show that the SFA algorithm is a function approximation for the full LEM problem.”
  7. They then show how SFA can be used on non-temporal data, and then end up with an algorithm very similar to LEM but that is computationally cheaper
  8. For doing LEM, data points are set up as existing in a Markov chain, with the transition probabilities set according to some distance metric/kernel
  9. Then you do an eigen decomposition based on the graph Laplacian to try and reduce the dimension of the problem
  10. “Note that LEMs assign a representation y^t to each data point x^t individually; that is they do not provide a smooth functional mapping between the input data and the representation.”
    1. <here things are indexed by t for time, for comparison to SFA, but this temporal relationship is not a necessary perspective of the algorithm>
  11. <Skipping the section on SFA as I’ve covered it a fair amount recently.  Onto the next section, SFA and Laplacian Eigenmaps>
  12. “… in the limit of large data sets with temporal structure, the objective functions of SFA and LEMs are equivalent when the neighborhood function for the LEM is determined by the transition probabilities of the data.”
  13. <not taking notes on the proof>
  14. “The two objective functions [in LEM and SFA] have the same mathematical structure and become identical if the neighborhood function W(x,x’) that measures the topological relations for the LEMs is chosen according to the joint probability distribution for subsequent data points…”
  15. “SFA can therefore be thought of as an LEM problem in which neighborhood relationships are implicitly defined by the temporal structure of the data.”
  16. “Although the objective functions become identical for a specific choice of the adjacency function, the optimization problems are different.  In LEMs, the representation y^t can be <is?> chosen individually for each data point x^t.  In SFA, in contrast, the functions g_j that map the input data x^t to the embedding y^t are chosed from a given function space F.  Consequently, the optimization problems for LEMs and SFA are equivalent only if the function space F is sufficiently rich to allow arbitrary mappings.”
  17. [next paragraph] “For poorer function spaces, SFA becomes a function approximation to the full LEM problem.  The quality of the approximation depends on the character of the function space F.”
  18. Given that both approaches have some benefits (LEM can represent more rich relationships, while SFA is computationally cheaper), given their relationship its possible to construct an algorithm that has both qualities
  19. “This algorithm is a generalization of SFA in the sense that it replaces temporal with arbitrary neighborhoods while maintaining the algorithmic elements of SFA.  Being a hybrid algorithm, however, it could with equal right be referred to as a function approximation of LEMs.”
  20. “In SFA, richer function spaces are often generated by a hierarchical iteration of simpler functions… this approach reduces the computational complexity for high-dimensional input signals and tends to avoid overfitting problems…” That is layers of SFA feeding to other layers of SFA above
    1. They present a way of doing hierarchical SFA with their modifications
  21. There is “anecdotal evidence” that doing heirarchical SFA allows the algorithm to have flexibility of that approaching LEM
  22.  “As Wiskott (2003) showed, the optimal output signals for SFA are harmonic oscillations in time.  This result was derived for the case where the dependence of the output on the input data is neglected, but the same harmonic oscillations are generated if the function space that SFA can access is sufficiently rich to allow independent values for the output signal for each data point in time.”
    1. Temporal-based LEMs will have the same characteristics if increasing time leads uniformly to increasing distance of values
  23. Protovalue Functions and SFA for RL. With PVFs an exploration phase is used to learn a LEM embedding.  The relationship of SFA and LEMs means that some previous work on RL and SFA is equivalent to learning proto value functions
  24. Remember though, that SFAs are problematic because basis functions must be selected a-priori.  The results showing equivalence to LEMs <and most other theoretic results on SFA> requires a perfect set of basis functions.  “For SFA, polynomial expansions are popular, although they often require a failsafe step that clips exceedingly large output signals that arise from outliers in unkown data.”
    1. Problems with linear extrapolation of basis functions also leads to problems <divergence> of value function approximation in RL
    2. They mention work that claims that hierarchical methods tend to allow for good coverage without overfitting
  25. LEMs capture the manifold structure of the data
  26. “In the limit of a large data set from a smooth manifold M, both SFA and LEMs were shown to yield eigenfunctions of a Laplace-type differential operator on the input manifold.  For LEMs, it was shown that in the large-sample limit, the graph Laplacian (with the normalization we consider here) converges to the Laplace-Beltrami operator on the manifold <whatever that means> (…). For SFA, it was also shown that the optimal functions are the eigenfunctions of a generalized Laplace operator (…) on the manifold.”
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: