Category Archives: Slow Feature Analysis

State Representation Learning in Robotics: Using Prior Knowledge about Physical Interaction. Jonschowski, Brock. RSS 20Re10Q

related to https://aresearch.wordpress.com/2015/04/18/learning-task-specific-state-representations-by-maximizing-slowness-and-predictability-jonschkowski-brock-international-workshop-on-evolutionary-and-reinforcement-learning-for-autonomous-robot-s/

  1. Uses the fact that robots interact with the physical world to set constraints on how state representations are learnt
  2. Test on simulated slot car and simulated navigation task, with distractors
  3. How to extract a low dimensional representation relevant to the task being undertaken from high dimensional sensor data?
  4. The visual input in the experiments is 300-D
  5. From the perspective of RL
  6. “According to Bengio et al. [1], the key to successful representation learning is the incorporation of “many general priors about the world around us.” They proposed a list of generic priors for artificial intelligence and argue that refining this list and incorporating it into a method for representation learning will bring us closer to artificial intelligence.”
  7. “State representation learning is an instance of representation learning for interactive problems with the goal to find a mapping from observations to states that allows choosing the right actions. Note that this problem is more difficult than the standard dimensionality reduction problem, addressed by multi-dimensional scaling [14] and other methods [23, 29, 6] because they require knowledge of distances or neighborhood relationships between data samples in state space. The robot, on the other hand, does not know about semantic similarity of sensory input beforehand. In order to know which observations correspond to similar situations with respect to the task, it has to solve the reinforcement learning problem (see Section III), which it cannot solve without a suitable state representation.”
  8. State representation learning can be done by:
    1. Deep autoencoders
    2. SFA (and its similarity to proto-value functions)
    3. Predictability / Predictive actions.  Points to a Michael Bowling paper <I haven’t read – will check out>
  9. Bunch of nice references
  10. The Robotic priors they care about (these are all defined mathematically later):
    1. Simplicity: For any task, only a small number of properties matter
    2. Temporal coherence: important properties change gradually over time
    3. Proportionality: Amount of change in important properties is proportional to action magnitude
    4. Causality: important properties and actions determine the reward
    5. Repeatability: Same as causality but in terms of transition not reward
  11. These properties hold for robotics and physical systems but aren’t necessarily appropriate to all domains
    1. Even in robotics these sometimes don’t hold (for example, a robot running into a wall will have an abrupt change in velocity; once pushed against the wall proportionality doesn’t hold because no amount of pushing will allow the robot to move
  12. They set learning up as an optimization problem that includes loss terms for each of the priors above (just a linear combination)
  13. <these priors aren’t directly applicable to the mocap work I am doing now because they involve actions which we dont have access to>
  14. Formally they would need to compare all samples, leading to n^2 loss, but they restrict this to a window
  15. Linear mapping from observations to states
  16. Epsilon greedy exploration, although there is a bias to repeat the previous action
  17. Have distractors in their simulations
  18. In the navigation task they demonstrate an invariance to perspective by using either an overhead or first-person view from the robot
    1. Representations learned are highly similar
  19. Learned in the course of 5,000 observations
  20. The observations in the experiment are 300 pixels or 100 (not 300×300)
  21. For the simulated slot car task, the state sample matrix has rank 4.  Two large eigenvalues correspond to the position of the controlled car, and two smaller eigenvalues correspond to the position of the distractor car
    1. Ideally the distractor shouldn’t show up at all, but because of things like stochasticity and limited samples, weight can be placed on it to explain events that it is not related to
  22. They then to an RL comparison based on a number of different methods for learning the representation (5 features extracted by)
    1. Their approach
    2. SFA
    3. PCA
    4. Raw 300D representation
    5. (They also compare to ground truth representation)
  23. Use neural fitted Q
  24. screenshot
  25. SFA features are really terrible, their approach is pretty similar to operating from the ground truth
  26. With further investigation on the same results they demonstrate that their method has very good generalization properties
  27. Conjecture that the primary difference between their approach and the other dimension reduction methods are that they didn’t learn to disregard the distractors
Advertisements

Autonomous Learning of State Representations for Control: An Emerging Field Aims to Autonomously Learn State Representations for Reinforcement Learning Agents from Their Real-World Sensor Observations. Bohmer, Springenberg, Beodecker, Riedmiller, Obermayer. Künstliche Intelligenz 2015.

  1. Trying to do robotics control directly from high-dimensional sensor readings.  The trick is that with robots experience is costly so you need to use data very efficiently <the deepmind guys were able to run each game for the equivalent of over a month because its just in simulation>
  2. Review 2 methods for extracting features: deep auto-encoders and slow feature analysis
  3. Mention deep learning’s deep Q network
  4. Until DQN, using FAs has been largely unsuccessful, and has generally worked only on toy domains <although Tesauro’s TD Gammon is a significant exception>
  5. “In the last years several researchers have also considered end-to-end learning of behavioral policies p, represented by general function approximators. First attempts towards this include an actor-critic formulation termed NFQ-CA [14], in which both the policy p and the Q-function are represented using neural networks and learning proceeds by back-propagating the error from the Q-network directly into the policy network—which was, however, only applied to low dimensional control problems.”
  6. Also cites deterministic policy gradient https://aresearch.wordpress.com/2014/06/09/deterministic-policy-gradient-algorithms-silver-lever-heess-degris-wierstra-riedmiller-icml-2014/, which worked on a 30-DOF arm
  7.  Cites deep fitted Q (same lab, papers from a couple of years prior that also was used on slot cars and an inverted pendulum from raw camera images
  8. The first part of deep fitted Q is unsupervised, setup as an autoencoder
  9. In order for the RL to work, it is important that the problem has low “intrinsic dimensionality” (the manifold it lies on)
    1. For example, the image a camera gets from a robotic inverted pendulum can be arbitrarily large, but the underlying dimension of the problem is just 2
  10. They do not assume a pomdp; assume that the high-dimensional input completely describes the problem <although this is almost certainly not actually the case with at least the slot car, but that is not a ding on their work it just makes it more impressive>
  11. The representation chosen for the state depends on an optimization problem
  12. Train an autoencoder to minimize MSE
  13. DFQ is a clustering-based approach<?>
    1. There are some problems with DFQ that they will discuss later, but:
    2. Representation learned by DFQ isn’t reward-based, so representation doesn’t care about that
    3. “learning auto-encoders for inputs with high variability (i.e. many objects of relevance) can be hard” Although says regularization can help this, and cites a number of other papers that deal with this
  14. “Despite these drawbacks DFQ also comes with advantages over end-to-end learning: since the auto-encoder merely learns to reconstruct sampled observations and it can be fed with samples generated by any sampling policy for any task, is thus less susceptible to non-stationarity of the training data.”
  15. Mentions things like proto-value functions, laplacian eigenmaps etc for discrete RL, relationship between SFA and LEM and PVF
  16. “From a theoretical point of view, SFA and PVF both approximate subspace-invariant features. This classification has its origin in the the analysis of approximation errors in linear RL [39]. Here subspace-invariant features induce no errors when the future reward is propagated back in time. It can be shown that under these conditions the least-squares temporal difference algorithm (LSTD, [9]) is equivalent to supervised least-squares regression of the true value function [5]. However, this is only possible for the class of RL-tasks with a self-adjoint transition model. As this class is very rare, both SFA and PVF substitute a self-adjoint approximation of the transition model to compute almost subspace-invariant representations.4 An analysis of the optimal solution5 shows that SFA approximates eigenfunctions of the symmetrized transition operator [50]. Moreover, with a Gaussian prior for the reward, one can show that SFA representations minimize a bound on the expected LSTD error of all tasks in the same environment [5]. However, as the solution depends on the sampling distribution, straight forward application for transfer learning is less obvious than in the case of PVF. Future works may rectify this with some sensible importance sampling, though.”
  17. When the data completely covers the space, PVF and SFA are pretty equivalent, but when data is generated from a random walk, PVF fails and SFA still manages to do almost as well as when there is perfect coverage
  18. In less artificial settings/domains there is less of a performance gap between PFV and SFA but it still exists, SFA seems to do a better job generating features for LSTD and LSPI
  19. In these problems, the representation learned must:
    1. Maintain Markov property (no partial observability)
    2. Be able to represent the value function well enough to bootstrap from it and improve the policy
    3. Generalize well (especially to cases that may not be well covered by the training distribution)
    4. Low dimensional
  20. “In stochastic environments one can only compare probability distributions over future states based on a random policy, called diffusion distances. It can be shown that SFA approximates eigenfunctions of the symmetrized transition operator, which encode diffusion distances [5]. SFA features are therefore a good representation for nonlinear RL algorithms as well.”
    1. <Depends how you define “stochastic environments” I assume they mean environments where there is no choice in action selection at all because otherwise what is written is incorrect>
  21. “In summary, SFA representations X seem in principle the better choice for both linear and non-linear RL: nonlinear SFA extracts eigenfunctions of the transition model Pp, which are the same for every isomorphic observation space Z, encode a diffusion metric that generalizes to states with similar futures and approximates a Fourier basis of the (unknown) underlying state space.”
  22. Empirical work shows SFA works best when exposed to data that gets uniform coverage
  23. “a Fourier basis as approximated by SFA grows exponential in the underlying state dimensionality. Linear algorithms, which depend on this basis to approximate the value function, are therefore restricted to low dimensional problems with few or no variables unrelated to the task. Non-linear RL algorithms, on the other hand, could work in principle well with only the first few SFA features of each state-dimension/variable. e. The order in which these variables are encoded as SFA features, however, depends on the slowness of that variable. This can in practice lead to absurd effects. Take our example of a wheeled robot, living in a naturally lit room. The underlying state space that the robot can control is three-dimensional, but the image will also depend on illumination, that is, the position of the sun.”
    1. Describes this as being vulnerable to slow distractors
  24. “Also, auto-encoders minimize the squared error over all input dimensions of Z equally. This can produce incomplete representations if a robot, for example, combines observations from a camera with measurements from multiple joints. Due to the large number of pixels, small improvements in the reconstruction of the image can outweigh large improvements in the reconstruction of the joint positions.”
  25. Compares results from a deep auto encoder and one with a NN that has objectives of slowness and predictability.  The latter produces a much better output <but this is really a garbage comparison because the prior is based on actual physical experiments and the latter is based on an extremely oversimplified clean simulation.>
  26. “Take the example of uncontrollable distractors like blinking lights or activity outside a window. Each distractor is an independent variable of the isomorphic state S, and to learn an isomorphic representation X requires thus samples from all possible combinations of these variables.  The required training samples grow exponentially in the number of distractors.”
    1. <The XSENS suit has tons of blinking lights on it.  I was thinking about this as a potential distractor for the methods we are working on…>
  27. Moves on to a discussion of “factored representations and symbolic RL” at the end of the paper
  28. Basically discuss object oriented RL (discuss it in terms of relational RL)

Locality Preserving Projections. He, Niyogi. NIPS 2003

  1. LPPs are “linear projective maps that arise by solving a variational problem that optimally preserves the neighborhood structure of the data set.”
    1. An alternative to PCA
  2. “When the high dimensional data lies on a low dimensional manifold embedded in the ambient space, the LPP are obtained by finding the optimal linear approximations to the eigenfunctions of the Laplace Beltrami operator <whatever that is, oh its the Laplacian on general spaces, not Euclidian> on the manifold.”
  3. LPP has many data representation properties nonlinear methods like Laplacian Eigenmaps or LLE, but LPP is linear and therefore is defined everywhere and not solely on the data points used for training
  4. Because LPP preserves local structure, nearest-neighbor search in the low dimensional space found by the LPP may be similar to that of the raw high-D space, so NN can be done more cheaply
  5. Fact that LPP is linear makes it “fast and suitable for practical application.”  Other similar methods are nonlinear and don’t have this property
  6. “LPP may be conducted in the original space or in the reproducing kernel Hilbert space (RKHS) into which data points are mapped.  This gives rise to kernel LPP.”
  7. LPP is a linear approximation of nonlinear Laplacian Eigenmap
  8. <LPP, SFA, a number of other methods are unified in: http://link.springer.com/chapter/10.1007%2F978-3-662-44851-9_30>
  9. Construct adjacency graphs either by epsilon-ball or k-nn, either choose weights as unit, or some other method (like Gaussian bump)
  10. Then solve eigenvector problem
  11. “While the Laplace Beltrami for a manifold is generated by the Riemannian metric, for a graph it comes from the adjacency relation.”
  12. <skipping the math>
  13. Kernel LPP generalizes to the nonlinear case

Learning Slow Features for Behaviour Analysis. Zafeiriou, Nicolaou, Zafeiriou, Nikitidis, Pantic. ICCV 2013.

  1. Discusses deterministic as well as probabilistic SFA <I don’t know anything about the latter>
  2. Derive a version of deterministic SFA “… that is able to identify linear projections that extract the common slowest varying features of two or more sequences.”
  3. Also, an EM method for inference in probabilistic SFA and extend it to handle multiple data sequences
  4. This EM-SFA algorithm “… can be combined with dynamic time warping techniques for robust sequence time-alignment.”
  5. Used on face videos
  6. <A number of good references in here – a few I still haven’t read>
  7. Seems like the probabalistic version of SFA is based on similarity to an ML estimate of a linear generative model with “… a Gaussian linear dynamical system with diagonal covariances over the latent space.” <Need to look into that more>
  8. Their version of probabilistic SFA is novel because it is based on EM, as opposed to other ML approaches.  This allows for modeling latent distributions that aren’t restricted by diagonal covariance matrices.
    1. <There is a hack in the math in the standard ML probabilistic SFA that maps variance to 0, which “… essentially reduces the model to a deterministic one…”
  9. Call facial expressions, action units (AUs), this is the first unsupervised method that does so based on temporal data; other unsupervised approaches are based on “…global structures…” <I assume that means global properties of still frames>
  10. <Not taking notes on particulars of math for probabilistic SFA, both ML and EM versions>
  11. When looking at more than one data sequence, SFA for this setting is designed to find the common slowly varying features
  12. <Naturally> assumption is that there is a common latent space between multiple-sequence SFA
    1. Math for both deterministic and probabilistic multi-sequence SFA
  13. <The time warping part / aligning sequences of different length is probably not relevant>
  14. The video data they work from is a public database of 400 annotated videos, each with a constrained structure of neutral, onset, apex, and offset.
  15. Not so clearly described, but seems like they dont operate off raw video data; they work off of 68 facial points tracked in 2D
  16. Compare performance of EM-SFA, SFA, and traditional linear dynamic systems.
    1. Seems like each is run on a subset of facial data, either mouth, eyes, or brows
  17. Use slowest feature of SFA as a classifier for when the expression is being performed (when its value goes from negative to positive, and back to negative)
  18. EM-SFA outperforms SFA ad LDS
  19. <Did a quick read, but I don’t see where they have experimental results of multi-sequence.  If it is indeed missing that is strange because they do have the math and the data already to test it.>
  20. The temporal alignment algorithm is used to match up videos from the same category so that the neutral, apex, etc… frames are synchronized
Tagged

Slow Feature Analysis for Human Action Recognition. Zhang, Tao. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. 2012

  1. Considers 4 forms of SFA for action recognition (apparently the other 3 are new for this work):
    1. (original) unsupervised SFA
    2. supervised SFA
    3. discriminitve SFA
    4. spatial discriminitive SFA
  2. Train SVM on slow features of squared input <not exactly, see ASD in #6>
  3. Test on a number of databases with good results
  4. ” First, a large number of local cuboids are collected by randomly sampling in motion boundaries. Then, a number
    of slow feature functions are learned from these local cuboids.” <I don’t know what a cuboid is>
  5. “To the best of our knowledge, this is the first work that uses the slowness principle or SFA to analyze human motions.”
  6. “Instead of using the responses of the learned slow feature functions directly, we propose the Accumulated Squared Derivative (ASD) feature to represent a given action sequence which is a statistical representation of the slow features in an action sequence.”
  7. 3 common classes of action recognition algorithms:
    1. Holistic features: global properties like body shape, joint angles, motion.  Require good segmentation and tracking; sensitive to noise and tracking errors
    2. Local descriptors: Find points of interest, bag of words, some other stuff
    3. Biologically inspired: Motion sensitivity inspired by visual cortex.  SFA is also somewhat biologically motivated but has a very different approach
  8. “There are four main steps in the SFA-based human action recognition, including Collection of training cuboids, Slow feature function learning, Action feature representation,and Classification.InSlow feature function learning, we extend the original SFA by using weakly supervised information and spatial information of the training cuboids to obtain discriminative slow feature functions for action classification.”
  9. Finding cuboids basically means finding what part of the image is of interest (contains a person in the foreground).  The have some presupplied information that comes with the data set (bounding box, cuboid is more the actual shape of the foreground figure as composed of many overlapping squares/cubes) that makes this easier.
  10. They use quadratic basis, but since that is too big, then use PCA to project down to 50 dimensions, some other preprocessing
  11. In supervised SFA, instead of running all data through the same algorithm, SFA is run independently on the data for each action of interest. “Finally, the statistical feature is computed with all slow feature functions. However, different actions may share many similar local motion patterns, so different labels to these “common” cuboids are misleading.” <not sure what this means exactly>
  12. “D [descriptive]-SFA is inspired by discriminative sparse coding [45], where a number of sets of discriminative dictionaries are learned, and each set of dictionaries is used to reconstruct a specific image class. Accordingly, D-SFA learns a number of sets of functions and each set of functions is used to slowdown a specific action class.”
    1.  “Therefore, each learned function makes the intraclass signals x_c(t) vary slowly, but makes the interclass signals x_{c’}(t) that are different from class c vary quickly.”
    2. Results in a similar set of constraints as standard SFA, but x values are now c [class]-indexed
  13. spatial discriminitve SFA adds spatial information.  The motivation is that some activities might involve more motion in a certain region (for example, upper could correspond to arm motion, while lower to leg motion)
    1. Here they do upper, mid, and lower
  14. For standard SFA “Each cuboid is considered as one minisequence. The covariance matrix and the time-derivative covariance matrix are calculated by combining all minisequences.”
  15. Do L1 normalization on the cuboids because the number of them may vary but the vectors representing the data must be the same length
  16. Bottom of p 441 has more in-depth description of algorithm run end-to-end
  17. The ASD is the concatenation of multiple slow feature functions.
  18. Classification is done via SVM on ASD
  19. More technical details in experimental results section
  20. <low on energy not taking much further notes>

Untangling Invariant Object Recognition. Dicarlo, Cox. TRENDS in Cognitive Sciences 2007.

Read this because this paper is pointed to as evidence of slow features doing object recognition, but really there is very little mention of SFA.

  1. Considers object recognition, and “…show that the primate ventral visual processing stream achieves a particularly effective solution in which single-neuron invariance is not the goal.”
  2. “Because brains compute with neurons, the subject must have neurons somewhere in its nervous system — ‘read-out’ — neurons – that can successfully report if objectg A was present […].”
  3. “The central issues are: what is the format of the representation used to support the decision (the substrate on which the decision functions operate); and what kinds of decision functions (i.e. read-out tools) are applied to that representation?”
  4. “… we treat object recognition fundamentally as a problem of data representation and re-representation, and we use simple decision functions (linear classifiers) to examine those representations.”
  5. Object recognition is fundamentally difficult, because “… each image projected into the eye is one point in an ~1 million dimensional retinal ganglion cell representation (…).”
  6. Manifolds: “… all the possible retinal images that face could ever produce… form a continuous, low-dimensional, curved surface inside the retinal image space called an object ‘manifold’…”
  7. Consider two potential image manifolds that “… do not cross or superimpose — they are like two sheets of paper crumpled together… We argue that this describes the computational crux of ‘everyday’ recognition: the problem is typically not a lack of information or noisy information, but that the information is badly formatted in the retinal representation — it is tangled (…).”
  8. “One way of viewing the overarching goal of the brain’s object recognition machinery, then, is as a transformation from visual representations that are easy to build (e.g. center-surround filters in the retina), but are not easily decoded … into representations that we do not yet know how to build (e.g. representations in IT), but are easily decoded…”
  9. “… single neurons at the highest level of the monkey ventral visual stream — the IT cortex — display spiking responses that are probably useful for object recognition.  Specifically, many individual IT neurons respond selectively to particular classes of objects, such as facesor other complex shapes, yet show some tolerance to changes in object position, size, pose, and illumination, and low-level shape cues.”
  10. In <what I suppose is actual> neural responses from 200 neurons in IT (the highest part of the visual system). Simple linear classifiers on activity were robust to variation in object position and size.  More sophisticated classifiers didn’t improve performance much, and were ineffective when applied directly to v1 activity.
    1. V1 doesn’t allow for simple separation and identification of objects but IT does.
  11. The classical approach to understanding visual processing is to consider the action of individual neurons — here they consider more the activity of entire populations in the representation of visual images
  12. According to this perspective, the best way to build object representations is to find a way to separate manifolds
  13. Because populations are considered, it is not the case that a single neuron must represent things (although if a single neuron can split the manifolds that is great).  Because “… real IT neurons are not, for example, position and size invariant, in that they have limited spatial receptive fields <as opposed to wide spatial receptive fields, as would be required if single neurons were actually responsible for representing invariances>[…].  It is now easy to see that this ‘limitation’ <of relying on populations of neurons as opposed to single neurons> is an advantage.”
  14. Manifold flattening may be hard wired, but it may also be the case that the means to do this is learned from the statistics of natural images. <The latter must play at least part of the role>
  15. Progressive flattening by increasing depth in visual system
    1. “… we believe that the most fruitful computational algorithms will be those that a visual system (natural or artificial) could apply locally and iteratively at each cortical processing stage (…) in a largely unsupervised manner(…) and that achieve some local object manifold flattening.”
  16. In the actual visual system, each layer projects information into a spaces of increasing dimension “… e.g. ~100 times more v1 neurons than retinal ganglion neurons …” as data gets projected to higher dimensions (if the projection is good) it makes simpler methods of analysis and classification more effective
    1. Additionally, at each stage the distribution of activity is “… allocated in a way that matches the distribution of visual information encountered in the real world …” which means the projections into higher dimensions is actually using the additional space reasonably
  17. The addition that temporal information helps untangle manifolds: “In the language of object tangling, this is equivalent to saying that temporal image evolution spells out the degrees of freedom of object manifolds.  The ventral stream might use this temporal evolution to achieve progressive flattening of object manifolds across neuronal processing stages.  Indeed, recent studies in our laboratory … have begun to connect this computational idea with biological vision, showing that invariant object recognition can be predictably manipulated by the temporal statistics of the environment.”
    1. This is where SFA is cited

How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis. Escalante-B, Wiskott. CogPrints 2013.

  1. Their extension for supervised SFA is called graph-based SFA
  2. “The algorithm extracts a label-predictive low-dimensional set of features that can be post processed by typical supervised algorithms to generate the final label or class estimation.”
  3. Trained with a graph where edge weights represent similarities
  4. The modification to SFA made here is that it accepts weights
  5. There are different ways of building these graphs, a very simple method generates results equivalent to the Fisher linear discriminant
  6. Claim is that supervised learning on high-dimensional data is tough, so often a dimension reduction step is taken (perhaps unsupervised).
    1. Here, a supervised dimension reduction step is proposed
  7. “GSFA and LE [Laplacian Eigenmaps] have the same objective function, but in general GSFA uses different edge-weight (adjacency) matrices, has different normalization constraints, supports nonde-weights, and uses function spaces.”
  8. GSFA can be used for both regression or classification, many approaches only work for one of the two
  9. “The central idea behind GSFA is to encode the label information implicitly in the structure of the input data, as some type of similarity matrix called edge-weight matrix, to indirectly solve the supervised learning problem, rather than performing an explicit fit to the labels.”
  10. In the graph, there are edge weights along with node weights, which specify a-priori sample properties
  11. “… hierarchical processing can also be seen as a regularization method because the number of parameters to be learned is typically smaller than if a single SFA node with the number of parameters than if a single SFA node with a huge input is used, leading to better generalization.”
    1. Another advantage is that if non-linear bases are used, the nonlinearity can allow for increasingly more complex functions per layer
  12. In graph edges are undirected, weighed, although it seems that the approach trivially generalizes to the directed case
  13. Basically they rewrite the original constraints of SFA with added weights
  14. Non-existing edges are given 0-weight
  15. Seems like they just end up using the graph to exactly calculate what the dynamics would be based on initialization probabilities (vertex weights) and transition probabilities (edge weights)
  16. How to construct the graph for either classification or regression is then discussed
  17. For graphs, they simply generate a separate graph for each class, with each item in each graph fully connected, and each sub-graph completely unconnected to items in a separate class, so basically there are independent fully connected components for each class
    1. There are some tricks that can be used due to the symmetry in each class cluster to make processing cheaper
  18. What is found by the classifier in this construction is equivalent to that of the Fisher linear discriminant
  19. “Consistent with FDA, the theory of SFA using unrestricted function space (optimal free responses) predicts that, for this type of problem, the first S – 1 slow features extracted are orthogonal step functions, and are piece-wise constant for samples from the same identity (…).  This closely approximates what has been observed empirically, which can be informally described as features that are approximately constant for sample of the same identity, with moderate noise.”
  20. <Immediately next paragraph> “When the features extracted are close to the theoretical predictions (e.g., their Δ-values are small), their structure is simple enough that one can use even a modest supervised step after SFA, such as a nearest centroid or a Gaussian classifier (in which a Gaussian distribution is fitted to each class) on S-1 slow features or less.”
    1. Using SVMs over Gaussians doesn’t make performance that much better, while being computationally more expensive
  21. Now on to regression
  22. For regression “The fundamental idea is to treat labels as the value of a hidden slow parameter that we want to learn.  In general, SFA will not extract the label values exactly.  However, optimization for slowness implies that samples with similar label values are typically mapped to similar output values.  After SFA reduces the dimensionality of the data, a complimentary explicit regression step on a few features solves the original regression problem.”
  23. They discuss 4 ways of doing the regression for SFA, the first one actually doesn’t even leverage
  24. In the version that doesn’t leverage graphs, simply sort data and then pass into SFA.  “Due to limitations of the feature space considered, insufficient data, noise, etc., one typically obtains noisy and distorted versions of the predicted signals.”
    1. On the other hand, its the easiest to implement (partially because vanilla SFA can be used) so “… we recommend its use for first experiments.”  If that doesn’t work well, use the GSFA approaches
  25. In the “sliding window training graph” items are sorted as above, but each vertex is connected to the d closest left and right items
  26. They recommend not using just 0 and 1 weights as it leads to “pathological solutions” – this may be what we’re picking up in ToH, and talk about why that happens. <This is worth investigating further.>
  27. In the “serial training graph,” data points are binned together – then points are all connected to points in adjacent bins, but they don’t connect all the points in a same bin  <why?>
    1. As is the case in other particular structures, can set up GSFA to be more efficient for this particular case
    2. Naturally, there is tuning required to see that the binning was done correctly
  28. The “mixed training graph” adds connections within a bin
  29. Then there is a supervised step on top of this stuff <am I missing something – I thought there were 4 in total?>
  30. “There are at least three approaches to implement the supervised step on top of SFA to learn a mapping from slow features to the labels. ” <
    1. First option is linear or nonlinear regression
    2. To bin and then classify <so you end up with discrete approx of regression?>
    3. Do a weighted version of #2 so you get continuous estimations
    4. <#s 2 and 3 immediately above look terribly hacky, if I am groking them correctly>
  31. Experimental results
  32. For classification they only check to see that indeed SFA does the same thing as Fisher linear discriminant (because that has already been studied exhaustively), which it does
    1. Interestingly in the benchmark task used, convnets are best, and outperform humans
  33. In the regression problems they take photos of people and estimate the horizontal position of the face, vertical position, and size.  This is all done separately <why?  Ah, because the sorting depends on the output variable, so you can only sort according to one… although it seems like a pretty simple extension could handle higher-dimensional outputs>
  34. Take face pictures from a number of data sets (a total of 64,471) and were “… automatically pre-processed through a pose-normalization and pose-reintroduction step. Basically they are all centered and then from there shifted and zoomed according to a distribution.  This way, they know what the x,y,z values they are estimating are
  35. Because of the size of the corpus and images themselves, its difficult to apply algs like SVMs directly, so they use hierarchical SFA and GSFA (which they also call HSFA <- great>)
    1. They also do a hierarchical version of PCA, which sort of does the opposite thing of SFA.  The 120 HPCA features used explain 88% of the variance
  36. Used a few different post-dimension reduction classifiers, including SVM
  37. The slow features of the data gets organized in a more orderly fashion as go up in hierarchy
  38. “… GSFA was 4% to 13% more accurate than the basic reordering of samples employing standard SFA.  In turn, reordering was at least 10% better than HPCA for this dataset.”
  39. Only 5 HSFA features are used, whereas 54 for HPCA. “This can be explained because PCA is sensitive to many factors that are irrelevant to solve the regression problem, such as the vertical position of the face, its scale, the background lighting, etc. Thus, the information that encodes the horizontal position of a face is mixed with other information and distributed over many principal components, whereas it is more concentrated in the slowest components of SFA.”
  40. Mixed and serial (straight SFA) outperformed the sliding window graphs <they were surprised but I’m not, at least with regards to mixed as it regular sliding window just seems like a strange construction).  The serial was actually better than the mixed, although the difference wasn’t significant
  41. They call these approaches “implicitly supervised” because the construction of the graph depends on the supervised labels, but the algorithm never sees those labels explicitly
  42. “The experimental results demonstrate that the larger number of connections considered by GSFA indeed provides a more robust learning than standard SFA.”
  43. Knock unsupervised dimension reduction by doing dimension reduction that doesn’t necessarily help in the task you are actually interested in <But this is only “implictly” supervised, by the same logic fully supervised dimension reduction would be better yet.>
  44. Being able to simply specify a graph means there is no need to exhaustively harvest data from a graph you may already have, as is the case in standard SFA
  45. GSFA has a tendency to overfit because it is not regularized, and is sensitive (in a bad way) to multiple types of data being used

Slowness: An Objective for Spike-Timing-Dependent Plasticity? Sprekeler, Michaelis, Wiskott. PLoS Computational Biology 2007

I Picked up this paper because I saw a reference to it saying this paper shows SFA equivalent to PCA of a low-pass fiter

  1. Explores how SFA can be implemented “…within the limits of biologically realistic spike-based learning rules.”
  2. Show a few ways SFA could be implemented with different neural models
  3. Fastness is measured by the average variance of the time derivative of the output features
  4. The part of the algorithm that is least plausible from the perspective of a biological system is the eigen decomposition.  The main aim of this paper is to show how this could be implemented “… in a spiking model neuron”
  5. “In the following, we will first consider a continuous model neuron and demonstrate that a modified Hebbian learning rule enables the neuron to learn the slowest (in the sense of SFA) linear combination of its inputs.  Apart from providing the basis for the analysis of the spiking model, this section reveals a mathematical link between SFA and the trace learning rule, another implementation of the slowness principle.  We then examine if these findings also hold for a spiking model neuron, and find that for a linear Poisson neuron, spike-timing-dependednt plasticity (STDP) can be interpreted as an implementation of the slowness principle
  6. “Even though there are neurons with transient responses to changes in the input, we believe it would be more plausible if we could derive an SFA-learning rule that does not depend on the time derivative, because it might be difficult to extract, especially for spiking neurons.”
    1. The time derivative can be replaced by a low-pass filter (a fair amount of math to show that)
    2. <But earlier in the paper they wrote> “It is important to note that the function g1(x) is required to be an instantaneous function of the input signal.  Otherwise, slow output signals could be generated by low-pass filtering the input signal.  As the goal of the slowness principle is to detect slowly varying features of the input signals, a mere low-pass filter would certainly generate slow output signals, but it would not serve the purpose.” <So then what is the difference between this and the low pass filter they just mentioned?>
    3. <After all the math> “Thus, SFA can be achieved either by minimizing the variance of the time derivative of the output signal or by maximizing the variance of the appropriately filtered output signal.” <Oh, I see.  You can’t just filter the output, you have to set up the system so it maximizes the variance of the filtered output?>
  7. Basically from a whitened input, you can either: use the time derivative and then choose the direction of minimal variance, or use a low-pass filter and the choose the direction of maximal variance
  8. “… standard Hebbian learning under the constraint of a unit weight vector applied to a linear unit maximizes the variance of the output signal. We have seen in the previous section that SFA can be reformulated as a maximization problem for the variance of the low-pass filtered output signal.  To achieve this, we simply apply Hebbian learning to the filtered input and output signals, instead of the original signals.” <The goal with analysis for the Hebbian rule is to find something more biologically plausible>
  9. “Thus, the filtered Hebbian plasticity rule … optimizes slowness … under the constraint of unit variance… ”  The requirement that the data already has unit variance  “… underlines the necessity for a clear distinction between processing <cleaning up the data?> and learning.”  <They talk more about processing vs learning but its not clear to me what they mean, which is unfortunate because they say the distinction is even more important when moving to the Poisson model neuron>
  10. SFA is a quadratic approximation of the trace rule (which comes from different power spectrums for the low pass filters the two use) <I don’t know what anything on this line means.>.  In the case that the power spectrum is parabolic (most input power is concentrated at lower frequencies) <Is that what that would mean?>, then the results of both will be similar.
  11. In SFA the outputs of each step are real-valued, but in a neuron there is simply spiking information.  Here a neuron is modeled by “inhomogenous Poisson processes” – here they simply consider information contained in the spike rate, and ignore information held in the exact spike timing (they describe this model mathematically)
  12. “…in an ensemble-averaged sense it is possible to generate the same weight distribution as in the continuous mode by means of an STDP rule with a specific learning window.”
  13. <I guess the model is optimal for learning slowness (mostly skipped that section) because it then follows up saying that they have yet to discuss why it is optimal>
  14. <Skipping neural modelling almost entirely>
  15. “In the first part of the paper, we show that for linear continuous model neurons, the slowest direction in the input signal can be learned by means of Hebbian learning on low-pass filtered versions of the input and output signal.  The power spectrum of the low-pass filter required for implementing SFA can be derived from the learning objective and has the shape of an upside-down parabola.”
  16. <Immediately following> “The idea of using low-pass filtered signals for invariance learning is a feature that our model has in common with several others […]. By means of the continuous model neuron, we have discussed the relation of our model to these ‘trace rules’ and have shown that they bear strong similarities.”
  17. In the implementation of SFA on the a Poisson neuron, “The learning window that realizes SFA can be calculated analytically.”
    1. “Interestingly, physiologically plausible parameters lead to a learning window whose shape and width is in agreement with experimental findings.  Based on this result, we propose a new functional interpretation of the STDP learning window as an implementation of the slowness principle that compensates for neuronal low-pass filters such as EPSP.”
  18. “Of course, the model presented here is not a complete implementation of SFA, the extraction of the most slowly varying direction from a set of whitened input signals.  To implement the full algorithm, additional steps are necessary: a nonlinear expansion of the input space, the whitening of the expanded input signals, and a means of normalizing the weights… On the network level, however, whitening could be achieved by adaptive recurrent inhibition between the neurons […].  This mechanism may also be suitable for extracting several slow uncorrelated signals as required in the original formulation of SFA […] instead of just one.”
  19. For weight normalization “A possible biological mechanism is synaptic scaling […], which is believed to multiplicatively rescale all synaptic weights according to postsynaptic activity… Thus it appears that most of the mechanisms necessary for an implementation of the full SFA algorithm are available, but that it is not yet clear how to combine them in a biologically plausible way.”
  20. <Immediately following>”Another critical point in the analytical derivation for the spiking model is the replacement of the temporal by the ensemble average, as this allows recovery of the rates that underlie the Poisson process.”  Data should be ergodic
  21. Not yet clear if these results can be reproduced with more realistic model neurons.
  22. “In summary, the analytical considerations presented here show that (i) slowness can be equivalently achieved by minimizing the variance of the time derivative signal or by maximizing the variance of the low-pass filtered signal, the latter of which can be achieved by standard Hebbian learning on the low-pass filtered input and the output signals; (ii) the difference between SFA and the trace learning rule lies in the exact shape of the effective low-pass filter–for most practical purposes the results are probably equivalent; (iii) for a spiking Poisson model neuron with an STDP learning rule, it is not the learning window that governs the weight dynamics but the convolution of the learning window with EPSP; (iv) the STDP learning window that implements the slowness objective is in good agreement with learning windows found experimentally.  With these results, we have reduced the gab between slowness as an abstract learning principle and biologically plausible STDP learning rules, and we offer a completely new interpretation of the standard STDP learning window.”

A Theoretical Basis for Emergent Pattern Discrimination in Neural Systems Through Slow Feature Extraction. Klampfl, Maas. Neural Computation 2010.

  1. Shows equivalence of slow feature analysis to Fisher linear discriminant
  2. “We demonstrate the power of this principle by showing that it enables readout neurons from simulated cortical microcircuits to learn without any supervision to discriminate between spoken digits and to detect repeated firing patterns that are embedded into a stream of noise spike trains with the same firing statistics.  Both these computer simulations and our theoretical analysis show that slow feature extraction enables neurons to extract and collect information that is spread out over a trajectory of of firing states that last several hundred ms.”
  3. Also can train neurons to keep track of time themselves
  4. Much of the learning the brain does in unsupervised.  SFA works in that scenario
    1. Intuition is that things that occur nearny in time are probably driven by the same underlying cause
  5. Perceptual studies show that slowness can form “…position- and view-invariant representations of visual objects in higher cortical areas.”
    1. Images were altered during blindness caused by a saccade, and the perception was that the two merged
    2. The authors of a related work <Li and DiCarlo – DiCarlo seems to be the common element of that work> say “unsupervised temporal slowness learning may reflect the mechanism by which the visual stream builds and maintains object representations.”
    3. Studies of IT in monkeys shows similar evidence
  6. This work builds on the theoretical basis of how slowness can be responsible for such effects
  7. Work by Sprekeler, Michaelis, Wiskott (07) also show SFA to be equivalent to PCA of a low-pass filter
    1. “In addition, they have shown that an experimentally supported synaptic plasticity rule, spike-timing-dependent plasticity (STDP), could in principle enable spiking neurons to learn this processing step without supervision, provided that the presynaptic inputs are processed in a suitable manner.”
  8. “… we show that SFA approximates the discrimination capability of the FLD in the sense that both methods yield the same projection direction, which can be interpreted as a separating hyperplane in the input space.”
  9. Intuitions drawn from SFA may help explain some phenomena that is sometimes not so well covered by “… the classical view of coding and computation in the brain, which was based on the assumption that external stimuli and internal memory items are encoded by firing states of neurons, which asssign a certain firing rate to a number of neurons maintained for some time interval.”
    1. In particular, trajectories in the brain are likewise encoded by trajectories of firing over a several hundred ms.
    2. The firing in terms of neural trajectories has been found in terms of static stimuli (such as odors and tastes) as well as things that are naturally time-varying (such as auditory or visual stimuli)
    3. Firing of sequences of neurons also true in cases when considering traces of episodic memory in hippocampus and cortex
  10. Being that there is neural activity that works in a temporally dispersed manner, perhaps SFA has something to say about how the brain actually uses that neural activity
  11. <There are also simulated neural results, but details are too complex to outline here so will address those points when I get to that section fully>
  12. A nice result of the neural experiments, and general properties of SFA is that when used over temporal data it behaves in an anytime sense (always producing whatever slow signals are most appropriate), so items that depend on its outputs may begin to formulate responses quickly
    1. This is especially important when set up in a hierarchical manner
  13. For the FLD part, consider linear basis functions only
  14. For classification, SFA has an additional parameter p, which dictates the probability of showing items of different classes back-to-back (needs probability of seeing in-class temporal pairs to be better than chance, but seems to be surprisingly robust to values anywhere in that range)
  15. In case where eigenvalues are nonzero <which should be the case?> the SFA objective and FLD are equivalent
  16. SFA/FLD can also be extended to the multiclass setting
  17. “The last line of equation 2.19 is just the formulation of FLD as a generalized eigenvalue problem… More precisely, the eigenvectors of the SFA problem are also eigenvectors of the FLD problem; the C-1 [where C is the number of classes] slowest features extracted by SFA applied to the time series x_t span the subspace that optimizes seperability in terms of he FLD… The slowest feature… is the weight vector that achieves maximal separation…”
  18. When used as a classifier, the optimal responses should look like a step function
  19. They then move onto the case where the trajectory provided to SFA is not made of isolated points, but rather sub-trajectories assembled into a larger trajectory
  20. The idea is to model what happens when sequences of neural firing occurs, such as is the case when episodic experience is reviewed mentally
  21. <Still a little unsure of the exact formulation of the input in this section>
  22. In an example where class means are similar, FLD produces the “right” linear separator, while SFA chooses one that is basically degenerate (each half-space consists about 50-50 of points from both classes, even though it is linearly separable)
  23. In this case, SFA and FLD produce different results, both analytically as well as empirically
    1. This is because “… the temporal correlations induced by the use of trajectories have an effect on the covariance of temporal differences in equation 2.24 compared to equation 2.12”
  24. <Again for trajectory of trajectories case> “… even for a small value of p, the objective of SFA cannot be solely reduced to the FLD objective, but rather there is a trade-off between the tendency to separate trajectories of different classes… and the tendency to produce smooth responses during individual trajectories…”
  25. In the vanilla SFA classification construction, SFA should really see transition between all pairs of points in the same class equally often; when trajectories are used this starts to fall apart because most of the time-pairs are within a single instance as opposed to between two items in the same class; as the sub-trajectories become long they dominate the features constructed so the ability to do classification is lost
  26. Linear separators aren’t strong, but as dimension increases the ability to do linear separation becomes increasingly similar <although I would also argue that as dimension increases risk of overfitting comes along with it>
    1. Not only does separability increase, margin does as well
    2. “In other words, a linear readout neuron with d presynaptic units can separate almost any pair of trajectories, each defined by connecting fewer than d randomly drawn points.”
  27. Now moves onto “… SFA as a possible mechanism for training readouts of a biological microcircuit.”
  28. Based on previous discussion, when data points themselves are trajectories, each slow feature itself wont classify data (as it does in the vanilla case): “… we predict that the class information will be distributed over multiple slow features.”
  29. high-order slow feature is one that is fast
  30. Now getting on to their simulated neural results.  Not sure how SFA ties in exactly yet
  31. They run SFA on the output of the neural circuit
  32. “We then trained linear SFA readouts on the 560-dimensional circuit trajectories, defined as the low-pass filtered spike trains of the spike response of all 560 neurons of the circuit…”
    1. The signal passed in consists of a sequence of static input, with poisson noise in between
    2. At first glance, tehre slow features don’t seem to respond much to the distinction between signal/noise
    3. But on average, the first 2 slow features don’t respond at all to the noise, while responding with a noticeable pattern to the signal.
  33. “One interesting property of this setup is that if we apply SFA directly on the stimulus trajectories, we basically achieve the same result [as they do when running on the neural respones].  In fact, the application to the circuit is the harder task because of the variability of the response to repeated presentations  of the same pattern and because of temporal integration: the circuit integrates input over time, making the response during a pattern dependent on the noise input immediately before the start of the pattern.”  Noise in the circuit takes a little while to die down after the static signal is introduced, which makes it harder to pick up the static signal
    1. <I don’t really get this, because earlier in the paper they said that SFA is bad for classifying items that manifest as a time-series.  I though the point of using a neural circuit was to try and circumvent that issue, but here they are saying it works better without the neural circuit?>
  34. Now moving on to recognition of spoken digits
  35. “We preprocessed the raw audio files with a model of the cochlea (Lyon, 1982) and converted the resulting analog cochleagrams into spike trains that serve as input to our microcircuit model (see section B.3.2 for details).”
  36. Classification between digit utterances of words “one” and “two”
  37. <Not groking their training methodology for this section – not so clearly written and I probably need to eat lunch.  They mention training two different times but I’m not clear on what the purpose of what this distinction is>
  38. “We found that the two slowest features, y_1 and y_2, responded with shapes similar to half sine waves during the presence of a trajectory… which is in fact the slowest possible response under the unit variance constraint.  Higher-order features partly consisted of full sine wave responses, which are the slowest possible responses under the additional constraint to be decorrelated to previous slow features.”
  39. <Immediately following in next P> “In this example, the slowest feature y_1 already extracts the class of the input patterns almost perfectly: it responds with positive values for trajectories in response to utterances of digit 2 and with negative values for trajectories of digit 1 and generalizes this behavior to unseen test examples.”
  40. The first slow feature is closest to the FLD
  41. The first slow feature encodes what (digit), while the second slow feature encodes where (corresponding to a position in the trajectory identified by SF1).  This has been found with other application of SFA (Wiskott Sejnowski 02).  Other faster features encode a mixture of the two
  42. A linear classifier on the SFAs (itself with a linear kernel) is very effective (98%)
  43. Training FLDs and SVMs on the same data as SFA results in poorer performance <Hm.  SVMs are pretty powerful – wonder why this result is what they have>
  44. They then go onto the same classification problem but with more data: “Due to the increased number of different samples for each class (for each speaker, there are now 10 different digits), this task is more difficult than the speaker-independent digit recognition.”
    1. “No single slow feature extracts What-information alone; the closest feature to the FLD is feature y_3.  To some extent also, y_4 extracts discriminative information about the stimulus.”
    2. “In such a situation where the distance between the class means is very small, the tendency to extract the trajectory class itself as a slow feature becomes negligible. In that case, the theory predicts that SFA tries to distinguish each individual trajectory due to the decorrelation …”
  45. In this situtation, SFA can be used as a preprocessor, but is not so useful as a classifier itself
  46. In some sense SFA is poorly suited to direct application on neural bursting information because such activity is inherently non-slow <Here maybe they low-pass filter it first?>
  47. When trying to do classification of time-series (sequences), “… the optimization problem of SFA can be viewed as a composition of two effects: the tendency to extract the trajectory class as a slow feature and the tendency to produce a smooth response during individual trajectories.”
  48. “In the context of biologically realistic neural circuits, this ability of an unsupervised learning mechanism is of particular interest because it could enable readout neurons, which typically receive inputs from, a large number of presynaptic neurons of the circuit, to extract from the trajectory of network states information about the stimulus that has caused this particular sequence of states–without any ‘teacher’ or reward.”
  49. Earlier work by Berkes showed similar results between SFA and FLD for handwritten digit recognition, but the two were not formally linked in that work
  50. <An excellent paper with tons of good references – worth rereading.>
Tagged

Generating Feature Spaces for Linear Algorithms with Regularized Sparse Kernel Slow Feature Analysis. Bohmer, Grunewalder, Nickisch, Obermayer. Machine Learning 2012.

  1. Deals with a way of automatically constructing nonlinear basis functions via SFA
  2. “Real-world time series can be complex, and current SFA algorithms are

    either not powerful enough or tend to over-fit. We make use of thekernel trickin combination with sparsificationto develop a kernelized SFA algorithm which provides a powerful

    function class for large data sets.”

  3. Also uses regularization to prevent overfitting on small data sets
  4. Hypothesize that “…our algorithm generates a feature space that resembles a Fourier basis in the unknown space of latent variables underlying a given real-world time series.
  5. Assume that solutions are defined on a low dimensional space of latent variables Θ which is embedded in the high dimensional space X that is actually observed
  6. Look for a feature space Φ s.t.
    1. For all i = {1,…p} φi, is non-linear in X in order to encode Θ as opposed to X
    2. For all = {1,…p} φi, is “… a well behaving functional basis in Θ, e.g. a Fourier basis
    3. Size of p is as low as possible to represent Θ
  7. Although there have been numerous studies highlighting its resemblance to biological sensor processing (…), the method has not yet found its way in the engineering community that focuses on the same problems. One of the reasons is undoubtedly the lack of an easily operated non-linear extension that is powerful enough to generate a Fourier basis.”
  8. Based on Kernel SFA

<Wordpress ate the rest of this post… Grr.>