HC-Search: Learning Heuristics and Cost Functions for Structured Prediction. Rao Doppa, Fern, Tadepalli. AAAI 2013

Won best paper award

  1. “Structured prediction is the problem of learning a function from structured inputs to structured outputs.”
  2. This paper deals with structured prediction applied to search (although it is not the first such paper)
  3. “Given a structured input, the framework uses a search procedure guided by a learned heuristic H to uncover high quality candidate outputs and then uses a separate learned cost function C to select a final prediction among those outputs.”
  4. The regret can be found as inability of H to find good results, and then C not selecting the best output <not sure what these two things mean exactly yet>
  5. Based on these two factors of regret, regret is minimized first by “training H to quickly uncover high quality outputs via imitation learning, and then training C to correctly rank the outputs generated via H according to their true losses.”
  6. <Claim> Excellent empirical results <this is probably true otherwise it would not win best paper award>
  7. Structured prediction is just supervised learning.  <Although it doesn’t seem like it is always part of structured learning> They leverage a cost function <which I suppose must be equivalent to an error metric>
  8. In general minimizing the cost function is intractable, and in practice “… heuristic inference methods such as loopy belief propagation and variational inference have shown some success in practice.  However, the learning algorithms generally assume exact inference and their behavior in the context of heuristic inference is not well understood.”
  9. <It seems like the problem is discrete, because they say that if the optimization problem is a tree then it is tractable, and also…>
  10. “Another approach to dealing with the Argmin problem is to eliminate it altogether and instead attempt to learn a classifier that can greedily produce a structured output by making a series of discrete decisions (…)  Unfortunately, in many problems, some decisions are difficult to make by a greedy classifier, but are crucial for good performance.”
  11. Mention some method which seems to be state of the art at the moment, but mention that it has a number of serious limitations that will be addressed here
  12. <I honestly am not sure what type of paper I’m reading here so far.  Is it search, optimization, or supervised learning?>
  13. The decomposition of the problem into H,C “… the performance of the approaches with a single function can be arbitrarily bad when compared to that of HC-Search in the worst case.”
  14. In practice, HC-Search also works better<, so it seems like a case where assumptions in the theory hold up>
  15. “Finally, we note that our methodology can be viewed as similar in spirit to Re-Ranking (Collins 2000), which uses a generative model to propose a k-best list of outputs, which are then ranked by a separate ranking function. In contrast, we search in efficient search spaces guided by a learned heuristic that has minimal restrictions on its representation.”
  16. <Seems like> the setting is like standard supervised learning, except you are also provided with a cost function that must be optimized <I suppose most supervised learning algorithms assert what cost they are optimizing, such as MSE, which might not be the case in the problem defined here>
  17. There is a sort of forward model given <seems like it can do multistep look ahead, starting with x and proposing different possible solutions y>
  18. <So it seems like there is a sort of combinatoric search problem>
  19. Both learned heuristic H and learned cost function C are mappings from × Y to reals.
    1. In addition there is some search strategy <seems like here they just assume greedy>
  20. Given some x and some time bound, is run using H and the best item found in terms of C during the whole search is returned
  21. They “… consider the feasibility of an exact solution to this learning problem in the simplest setting of greedy search using linear heuristic and cost functions represented by their weight vectors wH and wC respectively.
  22. They “… consider the HC-Search Consistency Problem, where the input is a training set of structured examples, and we must decide whether or not there exists wH and wC such that HC-Search using greedy search will achieve zero loss on the training set.”
    1. This problem is NP-Hard.  In fact, both problems of finding a zero-loss heuristic and cost are np-hard
  23. Even though the problem is NP-hard, the decomposition of and C still helps
  24. Because the problem is intractable, they resort to greedy stagewise search for and C
  25. Most generally, learning a heuristic can be viewed as a Reinforcement Learning (RL) problem where the heuristic is viewed as a policy for guiding “search actions” and rewards are received for uncovering high quality outputs. In fact, this approach has been explored for structured prediction in the case of greedy search (Wick et al. 2009) and was shown to be effective given a carefully designed reward function and action space. While this is a viable approach, general purpose RL can be quite sensitive to the algorithm parameters and specific definition of the reward function and actions, which can make designing an effective learner quite challenging. Indeed, recent work (Jiang et al. 2012), has shown that generic RL algorithms can struggle for some structured prediction problems, even with significant effort put forth by the designer.”
    1. They go with a non-RL approach, imitation learning
  26. Their formulation of imitation learning is distinct from supervised learning in that they only care that it helps the heuristic make correct choices.  Basically the only concern is that it gets the ordering of the values of the options correct, as opposed to getting the values themselves right
    1. This type of strategy is called rank-based
    2. “Most common search procedures such as greedy search, beam search, and best-first search fall under this category.”
  27. The rank-based learner they used is called “the online passive-aggressive algorithm”
  28. The cost-function learner <seems> to be trained to accurately predict the data from the distribution that A and H yield
  29. Their experimental results cover large standard test sets, such as handwriting recognition, scene labeling and some others
  30. “The results in the scene labeling domain are the most significant improving the error rate from 27.05 [the previous state of the art] to 19.71 .”
  31. “One of the advantages of our approach compared to many frameworks for structured prediction is the ability to use more expressive feature spaces without paying a huge computational price.”
  32. Outperforms other methods basically across the board <are there other state of the art methods here that aren’t listed?>
  33. Plain C-Search (as opposed to HC-Search) doesn’t work as well, and even with 3rd order features works poorer than HC-Search with 2nd order features.  The difference in performance can be attributed to the fact that HC-Search optimizies for the distribution found by H, which plain C-Search does not
  34. When giving the true loss function instead of the learned loss function H, improvements in performance weren’t significant

Actor-Critic Reinforcement Learning with Energy-Based Policies. Heess, Silver, Teh. JMLR W&C EWRL 2012

  1. For high-D RL
  2. Policy gradient with restricted Boltzmann machines (RBMs)
  3. Builds on Sallans and Hinton, but the trick here is to not use Boltzmann machines for VFA as that diverges
  4. Actor-critic
  5. Converges to local optimum
  6. “We consider a class of policies based on energy-based models [LeCun et al., 2006], where the (negative) log probability of selecting an action is proportional to an energy function… They can learn deep, distributed representations of high-dimensional data (such as images) and model high-order dependencies, and here we will use them to directly parameterize representations over states and actions.”
  7. RBMs aren’t just for RL – they do supervised learning as well
  8. Energy-based polices for RL are difficult to use because:
    1. The optimization landscape is nonconvex
    2. “… it is often intractable to compute the partition function that normalizes the energy function into a probability distribution”
    3. Evaluating a policy has high variance
  9. Uses “… the natural gradient, which reduces the dependence of the performance of the policy gradient on the parameterization [...]; and by using an actor-critic architecture, which uses an approximate value function to reduce the variance in the gradient estimates [...].”
  10. 2 actor-critic algorithms are introduced
  11. “The direction of the vanilla policy gradient is sensitive to reparameterizations of the policy that don’t affect the action probabilities.  Natural policy gradient algorithms [...] remove this dependence, by finding the direction that improves J(Θ) the most for a fixed small amount of change in distribution (e.g. measured by K.L. divergence).”
  12. The natural TD actor-critic algorithm (earlier work by other authors) has guarantees that “Provided that the average reward, TD error and critic are updated on slower time scale than for the actor, and step sizes are reduced at appropriate rates, NTDAC converges to a policy achieving near-local maximum average reward [...].”
  13. <Props for doing their experiments right> “In a cross-validation style setup, we first choose the parameters based on a preliminary parameter sweep, then fix the parameters and perform a final run for which we report the results.”
  14. The two algorithms they introduce here significantly outperform competitors
  15. <Props on doing OCTOPUS ARM!  Although they just have actions as binary values on 4 components, so its still pretty drastically simplified, seems like they use this representation for the shape too, so its not really enormous>
    1. Also shaping is included
    2. They actually loose out slightly to neural networks here, although catch up by the end
    3. “Some ENATDAC runs were trapped in suboptimal maxima where the arm moves only in one direction, thus taking a longer time to reach the target on some episodes…”

reinforcement learning with factored states and actions. Sallans Hinton. jmlr 2004

  1. Covers VFA for high dimensional state andexpe space
  2. Does the approximation as “… Negative free energies in an undirected graphical model called a product of experts
  3. Action selection is done by mcmc
  4. In one experimental result action spaces are 2^40
  5. VFA and action selection are two separate issues discussed here
  6. ” Our approach is to borrow techniques from the graphical modeling literature and apply then to the problems of value estimation and action selection.”
  7. Looks like considers pomdps.
    1. Oh looks like they use inference methods for pomdps to infer state value
  8. In their system computing “free energy” which I think is equivalent to value is tractable while doing action selection isn’t so mcmc is used
  9. Mdps considered here are very large but finite
  10. Looks like they do VFA according to TD updates
  11. For value approximation use a “particular kind of product of experts, called a restricted Boltzmann machine (…).”
  12. “Boltzmann machines are undirected models. That means that the model satisfies joint pro probabilities, rather than conditional probabilities.”
    1. DBNs on the other band are directed
  13. The reason for using undirected models is that value estimation in that case is tractable which isn’t the case in directed models (the inference is hard)
    1. A directed graph also restricts use to an actor-critic model
  14. In Boltzmann machines there are visible and hidden nodes with symmetric weights pairwise between all vertices
    1. Hidden nodes don’t have a fixed value so the approach is to consider all possible settings of hidden variables
    2. The probability of settings of hidden nodes is according to Boltzmann distribution
    3. Not taking extensive notes on this though
    4. Finding the equilibrium of the Boltzmann machine is done by mcmc.
  15. Boltzmann machines are called product of experts as each hidden node is called an expert and the values are simply products between nodes
  16. Exploitation with Boltzmann rule I think it works out simply from the Boltzmann machine itself
  17. The large action task they mentioned in the beginning is very smooth and not sure it has a sequential component
  18. There is another multiagent task they test in

Hashing with graphs. Liu Wang Kumar Chang. Icml 2011

  1. thsiders efficient hashing for high dimensional datasubdivides the rankings on a low d manifold
  2. graph based. Uses anchor graphs
  3. allows for constant time hashing by extrapolating graph laplacian eigenvectors to eigenfunctions
  4. Also shows hierarchical threshold learning where each eigenfunction produces multiple bits increasing accuracy
  5. For knn search
  6. Kd trees are usually ineffective in high d data
  7. Instead of kd trees use hashing
  8. Early basis of work is on locality sensitive hashing but that approach has some problems that are addressed here
    1. One problem is that approach is not data dependent
    2. One type is called spectral hashing but that isn’t considered here
  9. A problem with existing unsupervised hashing methods is they need a global distance measure. When dealing with manifolds you really want a different metric which this algorithm finds
    1. To do this exactly it becomes a quadratic cost which is prohibitive
  10. @Anchor graphs approximate the manifold by using a subset of points called anchors
    1. Allows for linear time algorithm
  11. The eigenvectors of the anchor graph laplacian can be generalized to functions in constant time
  12. Hashing on anchor graphs capture semantics so points close in hamming space produced by agh will also have similar semantic labels
  13. This works so well that similarity found by this approach outperforms a standard exhaustive search based on l2 norm
  14. In an anchor graph you select some small number Kand then apply k means to the points to find the k anchors.
    1. All points then have distance computed to all anchors as opposed to pairwise from all points to all points
  15. This creates an approximate adjacency matrix that has properties of
    1. Nonnegative
    2. Sparse
    3. Low rank
    4. Is doubly stochastic meaning the graph laplacian L = I-\hat{A}, where the last part is simply the anchor approximation of adjacency matrix
  16. These properties mean that eigen function extensions of the graph laplacian are possible
  17. Because it’s low rank doing eigen decomposition is cheap
  18. This process only computes values for points in corpus so generalization is needed in order to hash new data points
  19. Eigenvectors are transformed to eigenfunctions by nystrom method which I’ve heard of but don’t know about
  20. Setting up the eigenfunctions seems pretty simple
  21. this allows for hashing that isn’t dependent on size of corpus
  22. To generate r-bit codes r graph laplacian eigenvectors are used but you don’t simply just go up in r because if data is on a manifold you want to stay on lower eigenvectors.
    1. A simple hierarchical scheme is introduced here that resamples low valued eigenvectors
  23. the hierarchical scheme is that each subsequent bit subdivides the remaining points in that set
  24. Empirical performance is excellent compared to competitors

Untangling Invariant Object Recognition. Dicarlo, Cox. TRENDS in Cognitive Sciences 2007.

Read this because this paper is pointed to as evidence of slow features doing object recognition, but really there is very little mention of SFA.

  1. Considers object recognition, and “…show that the primate ventral visual processing stream achieves a particularly effective solution in which single-neuron invariance is not the goal.”
  2. “Because brains compute with neurons, the subject must have neurons somewhere in its nervous system — ‘read-out’ — neurons – that can successfully report if objectg A was present [...].”
  3. “The central issues are: what is the format of the representation used to support the decision (the substrate on which the decision functions operate); and what kinds of decision functions (i.e. read-out tools) are applied to that representation?”
  4. “… we treat object recognition fundamentally as a problem of data representation and re-representation, and we use simple decision functions (linear classifiers) to examine those representations.”
  5. Object recognition is fundamentally difficult, because “… each image projected into the eye is one point in an ~1 million dimensional retinal ganglion cell representation (…).”
  6. Manifolds: “… all the possible retinal images that face could ever produce… form a continuous, low-dimensional, curved surface inside the retinal image space called an object ‘manifold’…”
  7. Consider two potential image manifolds that “… do not cross or superimpose — they are like two sheets of paper crumpled together… We argue that this describes the computational crux of ‘everyday’ recognition: the problem is typically not a lack of information or noisy information, but that the information is badly formatted in the retinal representation — it is tangled (…).”
  8. “One way of viewing the overarching goal of the brain’s object recognition machinery, then, is as a transformation from visual representations that are easy to build (e.g. center-surround filters in the retina), but are not easily decoded … into representations that we do not yet know how to build (e.g. representations in IT), but are easily decoded…”
  9. “… single neurons at the highest level of the monkey ventral visual stream — the IT cortex — display spiking responses that are probably useful for object recognition.  Specifically, many individual IT neurons respond selectively to particular classes of objects, such as facesor other complex shapes, yet show some tolerance to changes in object position, size, pose, and illumination, and low-level shape cues.”
  10. In <what I suppose is actual> neural responses from 200 neurons in IT (the highest part of the visual system). Simple linear classifiers on activity were robust to variation in object position and size.  More sophisticated classifiers didn’t improve performance much, and were ineffective when applied directly to v1 activity.
    1. V1 doesn’t allow for simple separation and identification of objects but IT does.
  11. The classical approach to understanding visual processing is to consider the action of individual neurons — here they consider more the activity of entire populations in the representation of visual images
  12. According to this perspective, the best way to build object representations is to find a way to separate manifolds
  13. Because populations are considered, it is not the case that a single neuron must represent things (although if a single neuron can split the manifolds that is great).  Because “… real IT neurons are not, for example, position and size invariant, in that they have limited spatial receptive fields <as opposed to wide spatial receptive fields, as would be required if single neurons were actually responsible for representing invariances>[...].  It is now easy to see that this ‘limitation’ <of relying on populations of neurons as opposed to single neurons> is an advantage.”
  14. Manifold flattening may be hard wired, but it may also be the case that the means to do this is learned from the statistics of natural images. <The latter must play at least part of the role>
  15. Progressive flattening by increasing depth in visual system
    1. “… we believe that the most fruitful computational algorithms will be those that a visual system (natural or artificial) could apply locally and iteratively at each cortical processing stage (…) in a largely unsupervised manner(…) and that achieve some local object manifold flattening.”
  16. In the actual visual system, each layer projects information into a spaces of increasing dimension “… e.g. ~100 times more v1 neurons than retinal ganglion neurons …” as data gets projected to higher dimensions (if the projection is good) it makes simpler methods of analysis and classification more effective
    1. Additionally, at each stage the distribution of activity is “… allocated in a way that matches the distribution of visual information encountered in the real world …” which means the projections into higher dimensions is actually using the additional space reasonably
  17. The addition that temporal information helps untangle manifolds: “In the language of object tangling, this is equivalent to saying that temporal image evolution spells out the degrees of freedom of object manifolds.  The ventral stream might use this temporal evolution to achieve progressive flattening of object manifolds across neuronal processing stages.  Indeed, recent studies in our laboratory … have begun to connect this computational idea with biological vision, showing that invariant object recognition can be predictably manipulated by the temporal statistics of the environment.”
    1. This is where SFA is cited

How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis. Collins, Frank. European Journal of Neuroscience 2012.

  1. Uses an experiment specifically designed to tease apart contributions of working memory (WM) and the part of the brain more traditionally associated with RL (“…corticostriatial circuitry and the dopaminergic system.”)
    1. “By systematically varying the size of the learning problem and delay between stimulus repetitions, we separately extracted WM-specific effects of load and delay on learning. “
  2. Propose a new model for interactoin of RL and WM
  3. “Incorporating capacity-limited WM into the model allowed us to capture behavioral variance that could not be captured in a pure RL framework even if we (implausibly) allowed separate RL systems for each set size.  The WM component also allowed for a more reasonable estimation of a single RL process.”
  4. Also some genetics work <although I will probably go light on it in this reading>
  5. “Activity and plasticity in striatal neurons, a major target of dopaminergic efferents, are dynamically sensitive to these dopaminergic prediction error signals, which enable the striatum to represent RL values (O’Doherty et al., 2004; Frank, 2005; Daw & Doya, 2006)”
  6. “Thus, although this remains an active research area and aficionados continue to debate some of the details, there is widespread general agreement that the basal ganglia (BG) and dopamine are critically involved in the implementation of RL.”
  7. Humans, at least probably rely on more than simply prediction errors, we have the ability to do forward search, for example
  8. Genes controlling dopaminergic function may cause changes in behavior either after initial learning has occurred or during learning.  “Similarly, functional imaging studies have shown that dopaminergic drugs modulate striatal reward prediction error signals during learning, but that these striatal signals do not influence learning rates during acquisition itself; nevertheless, they are strongly predictive of subsequent choice indices measuring the extent to which learning was sensitive to probabilistic reward contingencies (Jocham et al., 2011).”
  9. Their experiments involve binary payoffs, document interaction of “…higher order WM and lower level RL components…” and show that accounting for WM can explain “…crucial aspects of behavioral variance…”
  10. “We further show that a genetic marker of prefrontal cortex function is associated with the WM capacity estimate of the model, whereas a genetic marker specific to BG function relates to the RL learning rate.”
  11. In the experiment there is binary with a presented signal and 3 possible responses (a 3-arm contextual bandit)
  12. They did genetic evaluation of the subjects
  13. Subjects did well in the task, with the last couple of trials/block being at >94% accuracy, and learning generally stabilized in 10 or less samples
  14. Different learning episodes varied in the size of the set that needed to be learned.  They considered impact of working memory in terms of load and delay, in terms of load, there may be a limit to the number of stimulus-response mappings that can be remembered, but in delay they consider the case where information may be cycling through, so consider the temporal connections between stimuli
    1. Considered how behavior diverged from optimal <regret!> in terms of delay.  This means they consider cases where the correct response was already given to a given stimulus
  15. Turns out that people were more likely to respond in error if the same stimulus was presented twice in a row than if there was a short delay between them <seems like the opposite of a switch cost?>
    1. “This indicated that, when the same stimulus was presented twice in a row, subjects were more likely to make an error in the second trial after having just responded correctly to that stimulus for lower set sizes. This finding may reflect a lower degree of task engagement for easier blocks, leading to a slightly higher likelihood of attentional lapses.”
  16. Logisitic regression done on the following variables: set size, delay since correct response to particular stimulus, total number of correct responses to stimulus
    1. Main effect of set size and correct repetitions
    2. Effect of delay was actually not a main effect but it did interact with # correct repititions
    3. “These results support the notion that, with higher set sizes as WM capacity was exceeded, subjects relied on more incremental RL, and less on delay-sensitive memory.”
  17. The logistic regression captured performance pretty accurately (mostly seems to have smoothed out the actual results in a good way) so gives a reasonable means to determine contributions of the various components to the final result
  18. Penalized increasingly complex models with Aikake’s information criterion
  19. The RL+WM model had the best fit “Thus, accounting for both capacity-limited WM and RL provides a better fit
    of the data than either process on its own (i.e. pure WM or pure RL models). Importantly, there was no trade-off in estimated parameters between the two main parameters of interest: capacity and RL learning rate, as would be revealed by a negative correlation between them”
  20. The capacity of WM estimated by the models was 3.7 +/- 0.14 which <kind of> agrees with standard estimtes
  21. <Basically skipping genetics part>
  22. ” The vast majority of neuroscientific studies of RL have focused on mechanisms underlying the class of ‘model-free’ RL algorithms; they capture the incremental learning of values associated with states and actions, without considering the extent to which the subject … can explicitly plan which actions to make based on their knowledge about the structure of the environment.”
  23. “It is clear that one such computational cost [of doing forward-search] is the capacity limitation of WM, which would be required to maintain a set of if⁄then relationships in mind in order to plan effectively … This secondary process is often attributed to the prefrontal cortex and its involvement in WM.”
  24. Here they show that WM has implications in even the simplest sort of behavior tasks, which models almost always don’t factor for
  25. When standard model-free RL algorithms are used to model behavior, their parameters naturally need to be searched over to fit the actual data.  Results here, however, show that they are really (partially) trying to adjust for capacity in WM as opposed to what they actually mean in the algorithm.   They are “… no longer estimates of the intended RL processes and are therefore misleading. Even when separate RL parameters were estimated for each set size in our experiment (an implausible, non-parsimonious model with 10 parameters), it did not provide as good a fit to the data as did our simpler hybrid model estimating WM contributions together with a simple process.”
  26. “The experimental protocol allowed us to determine that variance in behavior was explained separately by two characteristics of WM: its capacity and its stability. Indeed, behaviorally, learning was slower in problems with greater load, but there were minimal differences in asymptotic performance. Furthermore, although performance was initially highly subject to degradation due to the delay since the presented stimulus was last observed, this delay effect disappeared over learning.”
    1. Eventually the RL system supercedes the WM system
  27. Results here show that some components of commonly conducted RL experiments may be producing unanticipated influences on the results, also implications on whether policy is studied during learning or only after learning has occurred

How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis. Escalante-B, Wiskott. CogPrints 2013.

  1. Their extension for supervised SFA is called graph-based SFA
  2. “The algorithm extracts a label-predictive low-dimensional set of features that can be post processed by typical supervised algorithms to generate the final label or class estimation.”
  3. Trained with a graph where edge weights represent similarities
  4. The modification to SFA made here is that it accepts weights
  5. There are different ways of building these graphs, a very simple method generates results equivalent to the Fisher linear discriminant
  6. Claim is that supervised learning on high-dimensional data is tough, so often a dimension reduction step is taken (perhaps unsupervised).
    1. Here, a supervised dimension reduction step is proposed
  7. “GSFA and LE [Laplacian Eigenmaps] have the same objective function, but in general GSFA uses different edge-weight (adjacency) matrices, has different normalization constraints, supports nonde-weights, and uses function spaces.”
  8. GSFA can be used for both regression or classification, many approaches only work for one of the two
  9. “The central idea behind GSFA is to encode the label information implicitly in the structure of the input data, as some type of similarity matrix called edge-weight matrix, to indirectly solve the supervised learning problem, rather than performing an explicit fit to the labels.”
  10. In the graph, there are edge weights along with node weights, which specify a-priori sample properties
  11. “… hierarchical processing can also be seen as a regularization method because the number of parameters to be learned is typically smaller than if a single SFA node with the number of parameters than if a single SFA node with a huge input is used, leading to better generalization.”
    1. Another advantage is that if non-linear bases are used, the nonlinearity can allow for increasingly more complex functions per layer
  12. In graph edges are undirected, weighed, although it seems that the approach trivially generalizes to the directed case
  13. Basically they rewrite the original constraints of SFA with added weights
  14. Non-existing edges are given 0-weight
  15. Seems like they just end up using the graph to exactly calculate what the dynamics would be based on initialization probabilities (vertex weights) and transition probabilities (edge weights)
  16. How to construct the graph for either classification or regression is then discussed
  17. For graphs, they simply generate a separate graph for each class, with each item in each graph fully connected, and each sub-graph completely unconnected to items in a separate class, so basically there are independent fully connected components for each class
    1. There are some tricks that can be used due to the symmetry in each class cluster to make processing cheaper
  18. What is found by the classifier in this construction is equivalent to that of the Fisher linear discriminant
  19. “Consistent with FDA, the theory of SFA using unrestricted function space (optimal free responses) predicts that, for this type of problem, the first S – 1 slow features extracted are orthogonal step functions, and are piece-wise constant for samples from the same identity (…).  This closely approximates what has been observed empirically, which can be informally described as features that are approximately constant for sample of the same identity, with moderate noise.”
  20. <Immediately next paragraph> “When the features extracted are close to the theoretical predictions (e.g., their Δ-values are small), their structure is simple enough that one can use even a modest supervised step after SFA, such as a nearest centroid or a Gaussian classifier (in which a Gaussian distribution is fitted to each class) on S-1 slow features or less.”
    1. Using SVMs over Gaussians doesn’t make performance that much better, while being computationally more expensive
  21. Now on to regression
  22. For regression “The fundamental idea is to treat labels as the value of a hidden slow parameter that we want to learn.  In general, SFA will not extract the label values exactly.  However, optimization for slowness implies that samples with similar label values are typically mapped to similar output values.  After SFA reduces the dimensionality of the data, a complimentary explicit regression step on a few features solves the original regression problem.”
  23. They discuss 4 ways of doing the regression for SFA, the first one actually doesn’t even leverage
  24. In the version that doesn’t leverage graphs, simply sort data and then pass into SFA.  “Due to limitations of the feature space considered, insufficient data, noise, etc., one typically obtains noisy and distorted versions of the predicted signals.”
    1. On the other hand, its the easiest to implement (partially because vanilla SFA can be used) so “… we recommend its use for first experiments.”  If that doesn’t work well, use the GSFA approaches
  25. In the “sliding window training graph” items are sorted as above, but each vertex is connected to the d closest left and right items
  26. They recommend not using just 0 and 1 weights as it leads to “pathological solutions” – this may be what we’re picking up in ToH, and talk about why that happens. <This is worth investigating further.>
  27. In the “serial training graph,” data points are binned together - then points are all connected to points in adjacent bins, but they don’t connect all the points in a same bin  <why?>
    1. As is the case in other particular structures, can set up GSFA to be more efficient for this particular case
    2. Naturally, there is tuning required to see that the binning was done correctly
  28. The “mixed training graph” adds connections within a bin
  29. Then there is a supervised step on top of this stuff <am I missing something – I thought there were 4 in total?>
  30. “There are at least three approaches to implement the supervised step on top of SFA to learn a mapping from slow features to the labels. ” <
    1. First option is linear or nonlinear regression
    2. To bin and then classify <so you end up with discrete approx of regression?>
    3. Do a weighted version of #2 so you get continuous estimations
    4. <#s 2 and 3 immediately above look terribly hacky, if I am groking them correctly>
  31. Experimental results
  32. For classification they only check to see that indeed SFA does the same thing as Fisher linear discriminant (because that has already been studied exhaustively), which it does
    1. Interestingly in the benchmark task used, convnets are best, and outperform humans
  33. In the regression problems they take photos of people and estimate the horizontal position of the face, vertical position, and size.  This is all done separately <why?  Ah, because the sorting depends on the output variable, so you can only sort according to one… although it seems like a pretty simple extension could handle higher-dimensional outputs>
  34. Take face pictures from a number of data sets (a total of 64,471) and were “… automatically pre-processed through a pose-normalization and pose-reintroduction step. Basically they are all centered and then from there shifted and zoomed according to a distribution.  This way, they know what the x,y,z values they are estimating are
  35. Because of the size of the corpus and images themselves, its difficult to apply algs like SVMs directly, so they use hierarchical SFA and GSFA (which they also call HSFA <- great>)
    1. They also do a hierarchical version of PCA, which sort of does the opposite thing of SFA.  The 120 HPCA features used explain 88% of the variance
  36. Used a few different post-dimension reduction classifiers, including SVM
  37. The slow features of the data gets organized in a more orderly fashion as go up in hierarchy
  38. “… GSFA was 4% to 13% more accurate than the basic reordering of samples employing standard SFA.  In turn, reordering was at least 10% better than HPCA for this dataset.”
  39. Only 5 HSFA features are used, whereas 54 for HPCA. “This can be explained because PCA is sensitive to many factors that are irrelevant to solve the regression problem, such as the vertical position of the face, its scale, the background lighting, etc. Thus, the information that encodes the horizontal position of a face is mixed with other information and distributed over many principal components, whereas it is more concentrated in the slowest components of SFA.”
  40. Mixed and serial (straight SFA) outperformed the sliding window graphs <they were surprised but I’m not, at least with regards to mixed as it regular sliding window just seems like a strange construction).  The serial was actually better than the mixed, although the difference wasn’t significant
  41. They call these approaches “implicitly supervised” because the construction of the graph depends on the supervised labels, but the algorithm never sees those labels explicitly
  42. “The experimental results demonstrate that the larger number of connections considered by GSFA indeed provides a more robust learning than standard SFA.”
  43. Knock unsupervised dimension reduction by doing dimension reduction that doesn’t necessarily help in the task you are actually interested in <But this is only “implictly” supervised, by the same logic fully supervised dimension reduction would be better yet.>
  44. Being able to simply specify a graph means there is no need to exhaustively harvest data from a graph you may already have, as is the case in standard SFA
  45. GSFA has a tendency to overfit because it is not regularized, and is sensitive (in a bad way) to multiple types of data being used

Slowness: An Objective for Spike-Timing-Dependent Plasticity? Sprekeler, Michaelis, Wiskott. PLoS Computational Biology 2007

I Picked up this paper because I saw a reference to it saying this paper shows SFA equivalent to PCA of a low-pass fiter

  1. Explores how SFA can be implemented “…within the limits of biologically realistic spike-based learning rules.”
  2. Show a few ways SFA could be implemented with different neural models
  3. Fastness is measured by the average variance of the time derivative of the output features
  4. The part of the algorithm that is least plausible from the perspective of a biological system is the eigen decomposition.  The main aim of this paper is to show how this could be implemented “… in a spiking model neuron”
  5. “In the following, we will first consider a continuous model neuron and demonstrate that a modified Hebbian learning rule enables the neuron to learn the slowest (in the sense of SFA) linear combination of its inputs.  Apart from providing the basis for the analysis of the spiking model, this section reveals a mathematical link between SFA and the trace learning rule, another implementation of the slowness principle.  We then examine if these findings also hold for a spiking model neuron, and find that for a linear Poisson neuron, spike-timing-dependednt plasticity (STDP) can be interpreted as an implementation of the slowness principle
  6. “Even though there are neurons with transient responses to changes in the input, we believe it would be more plausible if we could derive an SFA-learning rule that does not depend on the time derivative, because it might be difficult to extract, especially for spiking neurons.”
    1. The time derivative can be replaced by a low-pass filter (a fair amount of math to show that)
    2. <But earlier in the paper they wrote> “It is important to note that the function g1(x) is required to be an instantaneous function of the input signal.  Otherwise, slow output signals could be generated by low-pass filtering the input signal.  As the goal of the slowness principle is to detect slowly varying features of the input signals, a mere low-pass filter would certainly generate slow output signals, but it would not serve the purpose.” <So then what is the difference between this and the low pass filter they just mentioned?>
    3. <After all the math> “Thus, SFA can be achieved either by minimizing the variance of the time derivative of the output signal or by maximizing the variance of the appropriately filtered output signal.” <Oh, I see.  You can’t just filter the output, you have to set up the system so it maximizes the variance of the filtered output?>
  7. Basically from a whitened input, you can either: use the time derivative and then choose the direction of minimal variance, or use a low-pass filter and the choose the direction of maximal variance
  8. “… standard Hebbian learning under the constraint of a unit weight vector applied to a linear unit maximizes the variance of the output signal. We have seen in the previous section that SFA can be reformulated as a maximization problem for the variance of the low-pass filtered output signal.  To achieve this, we simply apply Hebbian learning to the filtered input and output signals, instead of the original signals.” <The goal with analysis for the Hebbian rule is to find something more biologically plausible>
  9. “Thus, the filtered Hebbian plasticity rule … optimizes slowness … under the constraint of unit variance… ”  The requirement that the data already has unit variance  “… underlines the necessity for a clear distinction between processing <cleaning up the data?> and learning.”  <They talk more about processing vs learning but its not clear to me what they mean, which is unfortunate because they say the distinction is even more important when moving to the Poisson model neuron>
  10. SFA is a quadratic approximation of the trace rule (which comes from different power spectrums for the low pass filters the two use) <I don’t know what anything on this line means.>.  In the case that the power spectrum is parabolic (most input power is concentrated at lower frequencies) <Is that what that would mean?>, then the results of both will be similar.
  11. In SFA the outputs of each step are real-valued, but in a neuron there is simply spiking information.  Here a neuron is modeled by “inhomogenous Poisson processes” – here they simply consider information contained in the spike rate, and ignore information held in the exact spike timing (they describe this model mathematically)
  12. “…in an ensemble-averaged sense it is possible to generate the same weight distribution as in the continuous mode by means of an STDP rule with a specific learning window.”
  13. <I guess the model is optimal for learning slowness (mostly skipped that section) because it then follows up saying that they have yet to discuss why it is optimal>
  14. <Skipping neural modelling almost entirely>
  15. “In the first part of the paper, we show that for linear continuous model neurons, the slowest direction in the input signal can be learned by means of Hebbian learning on low-pass filtered versions of the input and output signal.  The power spectrum of the low-pass filter required for implementing SFA can be derived from the learning objective and has the shape of an upside-down parabola.”
  16. <Immediately following> “The idea of using low-pass filtered signals for invariance learning is a feature that our model has in common with several others [...]. By means of the continuous model neuron, we have discussed the relation of our model to these ‘trace rules’ and have shown that they bear strong similarities.”
  17. In the implementation of SFA on the a Poisson neuron, “The learning window that realizes SFA can be calculated analytically.”
    1. “Interestingly, physiologically plausible parameters lead to a learning window whose shape and width is in agreement with experimental findings.  Based on this result, we propose a new functional interpretation of the STDP learning window as an implementation of the slowness principle that compensates for neuronal low-pass filters such as EPSP.”
  18. “Of course, the model presented here is not a complete implementation of SFA, the extraction of the most slowly varying direction from a set of whitened input signals.  To implement the full algorithm, additional steps are necessary: a nonlinear expansion of the input space, the whitening of the expanded input signals, and a means of normalizing the weights… On the network level, however, whitening could be achieved by adaptive recurrent inhibition between the neurons [...].  This mechanism may also be suitable for extracting several slow uncorrelated signals as required in the original formulation of SFA [...] instead of just one.”
  19. For weight normalization “A possible biological mechanism is synaptic scaling [...], which is believed to multiplicatively rescale all synaptic weights according to postsynaptic activity… Thus it appears that most of the mechanisms necessary for an implementation of the full SFA algorithm are available, but that it is not yet clear how to combine them in a biologically plausible way.”
  20. <Immediately following>”Another critical point in the analytical derivation for the spiking model is the replacement of the temporal by the ensemble average, as this allows recovery of the rates that underlie the Poisson process.”  Data should be ergodic
  21. Not yet clear if these results can be reproduced with more realistic model neurons.
  22. “In summary, the analytical considerations presented here show that (i) slowness can be equivalently achieved by minimizing the variance of the time derivative signal or by maximizing the variance of the low-pass filtered signal, the latter of which can be achieved by standard Hebbian learning on the low-pass filtered input and the output signals; (ii) the difference between SFA and the trace learning rule lies in the exact shape of the effective low-pass filter–for most practical purposes the results are probably equivalent; (iii) for a spiking Poisson model neuron with an STDP learning rule, it is not the learning window that governs the weight dynamics but the convolution of the learning window with EPSP; (iv) the STDP learning window that implements the slowness objective is in good agreement with learning windows found experimentally.  With these results, we have reduced the gab between slowness as an abstract learning principle and biologically plausible STDP learning rules, and we offer a completely new interpretation of the standard STDP learning window.”

Should really look for another description of the algorithm if it exists.  For some reason I’m finding the paper very unclear on my first read through.  Its one of the few papers I’ve read that would be more understandable with more math and less English.

  1. HEXQ attempts to decompose and solve factored MDPs in a model-free manner
  2. Doesn’t deal with solving the decomposed MDP, just doing the decomposition itself
  3. Assumes:
    1. Some elements of the feature vector change less frequently than others
    2. The variables that change more often keep transition properties independently of more slowly changing variables
    3. “Interface between regions can be controlled.  For example, if a robot navigates around four equally sized rooms with interconnecting doorways (…) the state space can be represented by the two variables, room-identifier and position-in-room.  Most representations naturally label repeated sub-structures in this way.” <Seems to mean partitioning the feature vector along any particular dimension solely causes reasonable partitions of the state space>
  4. If these assumptions don’t hold, HEXQ will worst-case simply solve the flat problem
  5. Creates a hierarchy, with the maximum number of levels being the state dimension.  The bottom (level 1) is the variables that change most frequently
    1. “The rationale is that sub-tasks that are used most often appear at the lower levels and need to be learnt first.”
  6. Only the first level (fastest) interacts with the environment via primitive actions
  7. Start by observing the state variable that changes most frequently. “We now partition the states represented by the values of this variable into Markov regions.”  <I take this simply to mean regions of the flat state space corresponding to each possible assignment of that feature to a valid value, although not sure>
  8. “The boundaries between regions are identified by ‘unpredictable’ (see subsection 4.2) transitions which we call region exits.  We then define sub-MDPs over these regions and learn separate policies to leave each region via its various exits.”
  9. Regions are then combined with the next most frequently changing feature to make more abstract states at the next level in the hierarchy
    1. The exit policies then become abstract actions at that next level
    2. This results in a semi-MDP with one less feature in the state dimension and only abstract actions (exits, not primitive)
  10. The top-level has a sub-MDP that is solved by recursively calling lower-level MDP policies (exits) as its actions
  11. The feature ordering heuristic simply involves running a random trajectory and finding the frequency at which each feature changes
  12. Tries to model transitions as a directed graph <DBN?>, and random trajectories are taken.  “Following a set period of exploration in this manner, transitions that are unpredictable (called exits) are eliminated from the graph.”
  13. Transitions are unpredictable if:
    1. T or R is not stationary
    2. Some other variable changes value
    3. The terminal goal state is reached
  14. Entry states are reached after taking an exit (which is a state action pair <s^e, a>, where e is the level of the state)
    1. In Taxi, all states are entries because the agent is reset randomly after the goal state is reached <But doesn’t this criteria not jive with their assumptions that the domains are basically shortest path tasks?>
  15. An abstract state can only be left by an exit
  16. Approach:
    1. Decompose the transition graph into strongly connected components (SCCs)
    2. The SCCs form a DAG
    3. SCCs can then be merged together, potentially in a hierarchy – goal is to do as much of this as possible in order to minimize the # of abstract states
    4. “Following the coalition of SCCs into regions we may still have regions connected by edges from the underlying DAG.  We break these by forming additional exits and entries associated with their respective regions and repeat the entire procedure until no additional regions are formed.”  <Not exactly sure what this means yet>
  17. “A region, therefore, is a combination of SCCs such that any exit state in a region can be reached from any entry with probability 1.  Regions are generally aliased in the environment.”
  18. Regions have sub-goals of reaching exits
  19. “We proceed to construct multiple sub-MDPs one for each unique hierarchical exit state (s1, s2, …, se) in each region. Sub-MDP policies in HEXQ are learnt on-line, but a form of hierarchical dynamic programming could be used directly as the sub-task models have already been uncovered.”
  20. A region may have multiple exits, and each exit may lead to a separate sub-MDP <But those sub-MDPs are on the same level according to the wording which I don’t understand:> “In the taxi example, the above procedure finds one hierarchical level 1 region as reflected in figure 2.  This region has 8 exits… As there are 4 hierarchical exit states we create 4 sub-MDPs at level 1 and solve them.”
  21. Taking an abstract action at level e is equivalent to executing the policy for the appropriate region at level e-1
  22. In taxi, level 1 has just one state.  Level 2 has 5 states <corresponding to the 4 passenger locations plus the 1 state at level 1?>.  The abstract actions involve being at one of the locations where a passenger is and either doing a pickup or a putdown  <It seems like relevant aspects of state are ignored though, so it just has to do with pickup or putdown in a particular location on the map, regardless of whether a passenger is in the car or not, or what the destination is?  Oh, it seems like information at the next level up encodes whether there is a passenger in the car or not>.
  23. <The trick to go from deterministic shortest path problems to stochastic shortest path problems feels a bit hacky.  Basically, they keep long and short term statistics recorded, and see if they match or not>
  24. In taxi, HexQ ends up with a representation not much larger than MaxQ, while learns relevant hierarchical information itself
  25. Exits must work with certainty, how to generalize the algorithm to stochastic case is set for future work
  26. “The two heuristics employed by HEXQ are (1) variable ordering and (2) finding non-stationary transitions.”
  27. A problem is that “An independent slowly changing random variable would be sorted to the top level in the hierarchy by this heuristic and the MDP would fail to decompose as it is necessary to explore the variable’s potential impact.”
  28. HEXQ, like MAXQ is recursively optimal

A Theoretical Basis for Emergent Pattern Discrimination in Neural Systems Through Slow Feature Extraction. Klampfl, Maas. Neural Computation 2010.

  1. Shows equivalence of slow feature analysis to Fisher linear discriminant
  2. “We demonstrate the power of this principle by showing that it enables readout neurons from simulated cortical microcircuits to learn without any supervision to discriminate between spoken digits and to detect repeated firing patterns that are embedded into a stream of noise spike trains with the same firing statistics.  Both these computer simulations and our theoretical analysis show that slow feature extraction enables neurons to extract and collect information that is spread out over a trajectory of of firing states that last several hundred ms.”
  3. Also can train neurons to keep track of time themselves
  4. Much of the learning the brain does in unsupervised.  SFA works in that scenario
    1. Intuition is that things that occur nearny in time are probably driven by the same underlying cause
  5. Perceptual studies show that slowness can form “…position- and view-invariant representations of visual objects in higher cortical areas.”
    1. Images were altered during blindness caused by a saccade, and the perception was that the two merged
    2. The authors of a related work <Li and DiCarlo – DiCarlo seems to be the common element of that work> say “unsupervised temporal slowness learning may reflect the mechanism by which the visual stream builds and maintains object representations.”
    3. Studies of IT in monkeys shows similar evidence
  6. This work builds on the theoretical basis of how slowness can be responsible for such effects
  7. Work by Sprekeler, Michaelis, Wiskott (07) also show SFA to be equivalent to PCA of a low-pass filter
    1. “In addition, they have shown that an experimentally supported synaptic plasticity rule, spike-timing-dependent plasticity (STDP), could in principle enable spiking neurons to learn this processing step without supervision, provided that the presynaptic inputs are processed in a suitable manner.”
  8. “… we show that SFA approximates the discrimination capability of the FLD in the sense that both methods yield the same projection direction, which can be interpreted as a separating hyperplane in the input space.”
  9. Intuitions drawn from SFA may help explain some phenomena that is sometimes not so well covered by “… the classical view of coding and computation in the brain, which was based on the assumption that external stimuli and internal memory items are encoded by firing states of neurons, which asssign a certain firing rate to a number of neurons maintained for some time interval.”
    1. In particular, trajectories in the brain are likewise encoded by trajectories of firing over a several hundred ms.
    2. The firing in terms of neural trajectories has been found in terms of static stimuli (such as odors and tastes) as well as things that are naturally time-varying (such as auditory or visual stimuli)
    3. Firing of sequences of neurons also true in cases when considering traces of episodic memory in hippocampus and cortex
  10. Being that there is neural activity that works in a temporally dispersed manner, perhaps SFA has something to say about how the brain actually uses that neural activity
  11. <There are also simulated neural results, but details are too complex to outline here so will address those points when I get to that section fully>
  12. A nice result of the neural experiments, and general properties of SFA is that when used over temporal data it behaves in an anytime sense (always producing whatever slow signals are most appropriate), so items that depend on its outputs may begin to formulate responses quickly
    1. This is especially important when set up in a hierarchical manner
  13. For the FLD part, consider linear basis functions only
  14. For classification, SFA has an additional parameter p, which dictates the probability of showing items of different classes back-to-back (needs probability of seeing in-class temporal pairs to be better than chance, but seems to be surprisingly robust to values anywhere in that range)
  15. In case where eigenvalues are nonzero <which should be the case?> the SFA objective and FLD are equivalent
  16. SFA/FLD can also be extended to the multiclass setting
  17. “The last line of equation 2.19 is just the formulation of FLD as a generalized eigenvalue problem… More precisely, the eigenvectors of the SFA problem are also eigenvectors of the FLD problem; the C-1 [where C is the number of classes] slowest features extracted by SFA applied to the time series x_t span the subspace that optimizes seperability in terms of he FLD… The slowest feature… is the weight vector that achieves maximal separation…”
  18. When used as a classifier, the optimal responses should look like a step function
  19. They then move onto the case where the trajectory provided to SFA is not made of isolated points, but rather sub-trajectories assembled into a larger trajectory
  20. The idea is to model what happens when sequences of neural firing occurs, such as is the case when episodic experience is reviewed mentally
  21. <Still a little unsure of the exact formulation of the input in this section>
  22. In an example where class means are similar, FLD produces the “right” linear separator, while SFA chooses one that is basically degenerate (each half-space consists about 50-50 of points from both classes, even though it is linearly separable)
  23. In this case, SFA and FLD produce different results, both analytically as well as empirically
    1. This is because “… the temporal correlations induced by the use of trajectories have an effect on the covariance of temporal differences in equation 2.24 compared to equation 2.12″
  24. <Again for trajectory of trajectories case> “… even for a small value of p, the objective of SFA cannot be solely reduced to the FLD objective, but rather there is a trade-off between the tendency to separate trajectories of different classes… and the tendency to produce smooth responses during individual trajectories…”
  25. In the vanilla SFA classification construction, SFA should really see transition between all pairs of points in the same class equally often; when trajectories are used this starts to fall apart because most of the time-pairs are within a single instance as opposed to between two items in the same class; as the sub-trajectories become long they dominate the features constructed so the ability to do classification is lost
  26. Linear separators aren’t strong, but as dimension increases the ability to do linear separation becomes increasingly similar <although I would also argue that as dimension increases risk of overfitting comes along with it>
    1. Not only does separability increase, margin does as well
    2. “In other words, a linear readout neuron with d presynaptic units can separate almost any pair of trajectories, each defined by connecting fewer than d randomly drawn points.”
  27. Now moves onto “… SFA as a possible mechanism for training readouts of a biological microcircuit.”
  28. Based on previous discussion, when data points themselves are trajectories, each slow feature itself wont classify data (as it does in the vanilla case): “… we predict that the class information will be distributed over multiple slow features.”
  29. high-order slow feature is one that is fast
  30. Now getting on to their simulated neural results.  Not sure how SFA ties in exactly yet
  31. They run SFA on the output of the neural circuit
  32. “We then trained linear SFA readouts on the 560-dimensional circuit trajectories, defined as the low-pass filtered spike trains of the spike response of all 560 neurons of the circuit…”
    1. The signal passed in consists of a sequence of static input, with poisson noise in between
    2. At first glance, tehre slow features don’t seem to respond much to the distinction between signal/noise
    3. But on average, the first 2 slow features don’t respond at all to the noise, while responding with a noticeable pattern to the signal.
  33. “One interesting property of this setup is that if we apply SFA directly on the stimulus trajectories, we basically achieve the same result [as they do when running on the neural respones].  In fact, the application to the circuit is the harder task because of the variability of the response to repeated presentations  of the same pattern and because of temporal integration: the circuit integrates input over time, making the response during a pattern dependent on the noise input immediately before the start of the pattern.”  Noise in the circuit takes a little while to die down after the static signal is introduced, which makes it harder to pick up the static signal
    1. <I don’t really get this, because earlier in the paper they said that SFA is bad for classifying items that manifest as a time-series.  I though the point of using a neural circuit was to try and circumvent that issue, but here they are saying it works better without the neural circuit?>
  34. Now moving on to recognition of spoken digits
  35. “We preprocessed the raw audio files with a model of the cochlea (Lyon, 1982) and converted the resulting analog cochleagrams into spike trains that serve as input to our microcircuit model (see section B.3.2 for details).”
  36. Classification between digit utterances of words “one” and “two”
  37. <Not groking their training methodology for this section – not so clearly written and I probably need to eat lunch.  They mention training two different times but I’m not clear on what the purpose of what this distinction is>
  38. “We found that the two slowest features, y_1 and y_2, responded with shapes similar to half sine waves during the presence of a trajectory… which is in fact the slowest possible response under the unit variance constraint.  Higher-order features partly consisted of full sine wave responses, which are the slowest possible responses under the additional constraint to be decorrelated to previous slow features.”
  39. <Immediately following in next P> “In this example, the slowest feature y_1 already extracts the class of the input patterns almost perfectly: it responds with positive values for trajectories in response to utterances of digit 2 and with negative values for trajectories of digit 1 and generalizes this behavior to unseen test examples.”
  40. The first slow feature is closest to the FLD
  41. The first slow feature encodes what (digit), while the second slow feature encodes where (corresponding to a position in the trajectory identified by SF1).  This has been found with other application of SFA (Wiskott Sejnowski 02).  Other faster features encode a mixture of the two
  42. A linear classifier on the SFAs (itself with a linear kernel) is very effective (98%)
  43. Training FLDs and SVMs on the same data as SFA results in poorer performance <Hm.  SVMs are pretty powerful – wonder why this result is what they have>
  44. They then go onto the same classification problem but with more data: “Due to the increased number of different samples for each class (for each speaker, there are now 10 different digits), this task is more difficult than the speaker-independent digit recognition.”
    1. “No single slow feature extracts What-information alone; the closest feature to the FLD is feature y_3.  To some extent also, y_4 extracts discriminative information about the stimulus.”
    2. “In such a situation where the distance between the class means is very small, the tendency to extract the trajectory class itself as a slow feature becomes negligible. In that case, the theory predicts that SFA tries to distinguish each individual trajectory due to the decorrelation …”
  45. In this situtation, SFA can be used as a preprocessor, but is not so useful as a classifier itself
  46. In some sense SFA is poorly suited to direct application on neural bursting information because such activity is inherently non-slow <Here maybe they low-pass filter it first?>
  47. When trying to do classification of time-series (sequences), “… the optimization problem of SFA can be viewed as a composition of two effects: the tendency to extract the trajectory class as a slow feature and the tendency to produce a smooth response during individual trajectories.”
  48. “In the context of biologically realistic neural circuits, this ability of an unsupervised learning mechanism is of particular interest because it could enable readout neurons, which typically receive inputs from, a large number of presynaptic neurons of the circuit, to extract from the trajectory of network states information about the stimulus that has caused this particular sequence of states–without any ‘teacher’ or reward.”
  49. Earlier work by Berkes showed similar results between SFA and FLD for handwritten digit recognition, but the two were not formally linked in that work
  50. <An excellent paper with tons of good references – worth rereading.>

Get every new post delivered to your Inbox.