Category Archives: Dimension Reduction

Compressed Sensing, Sparsity, and Dimensionality in Neuronal Information Processing and Data Analysis. Ganguli, Sompolinsky. Annual Review of Neuroscience 2012.

  1. Considers ways to work with high dimensional data (and dimension reduction) with a particular focus on visual processing
  2. “… natural images are often sparse in the sense that if you view them in the wavelet domain (roughly as a superposition of the edges), only a very small number of K wavelet coefficients will have significant power, where K can be on the order of 20,000 for a 1-million-pixel image.”
    1. “… similarly, neuronal activity patterns that actually occur are often a highly restricted subset of all possible patterns (…) in the sense that they often lie along a low K-dimensional firing-rate space…”
  3. Here they consider random projections to do dimensionality reduction 
  4. “…compressed sensing… shows that the shadow can contain enough information to reconstruct the original image … as long as the original image is sparse enough.
  5. L1 minimization
  6. Random projections can be achieved with neurons
  7. “… CS [compressed sensing] and RPs [random projections] can provide a theoretical framework for understanding one of the most salient aspects of neuronal information processing: radical changes in the dimensionality, and sometimes sparsity, of neuronal representations, often within a single stage of synaptic transformation.”
  8. CS also deals with modeling high dimensional data, which is also something the brain has to deal with
  9. “CS provides mathematical guarantees that one can achieve perfect recovery with a number of measurements  M that is only slightly larger than K [the underlying basis dimension basically], as long as the M measurement vectors are sufficiently incoherent with respect to the sparsity domain (…). [next paragraph]  An important observation is that any set of measurement vectors, which are themselves random, will be incoherent with respect to any fixes sparsity domain.”
  10. Only M > O(K log(N/K)) measurements are needed to guarantee perfect reconstruction with high probability
  11. “… no measurement matrices and no reconstruction algorithm can yield sparse signal recovery with substantially fewer measurements [than is required with random projection]”
  12. L1 regularization solves a slightly different problem than what one would usually care for, but the original problem is intractable whereas L1 regularization is ok, and it gets very close to optimal results
  13. “… any projection that preserves the geometry of all K-sparse vectors allows one to reconstruct these vectors from the low-dimensional projection efficiently and robustly using L1 minimization”
  14. “The celebrated Johnson-Lindenstrauss (JL) lemma (…)  provides a striking answer [to how small we can the dimension until the distances in the projection become significantly distorted from that in the original high-dimensional representation]… It states that RPs with M > O(log P) will yield, with high probability, only a small distortion in distance between all pairs of points in the cloud.  Thus the number of projected dimensions M needs only be logarithmic in the number of points P independent of the embedding dimension of the source data, N.”
  15. For “…data distributed along a nonlinear K-dimensional manifold embedded in N-dimensional space… M > O(K log NC) RPs preserve the geometry of the manifold with small distortion.”  Where C represents curvature of the manifold
  16. Results mean that few RPs can be used to represent data accurately
  17. When noise exists, use LASSO
  18. “The main outcome is roughly that for an appropriate choice of λ, which depends on the signal-to-noise ratio (SNR), the same conditions that guaranteed exact recovery of K-sparse signals by L1 minimization in the absence of noise also ensure good performance of the LASSO for approximately sparse signals in the presence of noise.”
  19. Can be used for dictionary learning
  20. <discussion of a number of particular topics, such as MRI, gene expression analysis (and others), skipping>
  21. In the brain “information stored in a large number of neurons is often compressed into a small number of axons, or neurons in a downstream system.  For example, 1 million optic nerve fibers carry information about the activity of 100 times as many photoreceptors.”
  22. The JL Lemma shows that you only need a logarithmic number of projections in the number of classes you care about to maintain accuracy.  In the brain, “… 20,000 images can be represented by the corresponding population activity in the IT cortex. The the similarity structure between all pairs of images can be preserved to 10% precision in a downstream area using only ~1000 neurons.  Furthermore, this result can be achieved by a very simple dimensionality-reduction scheme, namely by a random synaptic connectivity matrix.”
  23. In the case where the points are continuous and not discrete, significant compression is possible when the points lie on a manifold.  In this case only a logarithmic number of neurons (in the number of neurons in the source area) are required.
  24. “The ubiquity of this low-dimensional structure in neuronal systems may be intimately related to the requirement of communication and computation through widespread anatomical bottlenecks.”
  25. “Another bottleneck is posed by the task of working memory, where streams of sensory inputs must presumably be stored within the dynamic reverberations of neuronal circuits.  This is a bottleneck from time into space.”  Temporally extended input streams must be represented in a finite number of neurons.  Recurrent networks allow for this sort of representation.
  26. Other work shows “…a connection between CS and short-term memory by showing that recurrent neuronal networks can essentially perform online, dynamical compressed sensing of an incoming sparse sequence, yielding sequence memory traces that are longer than the number of neurons, again in units of the intrinsic time constant.”
  27. In many cases, the goal is not to compress the signal, but instead to project it to a higher dimensional space. “For example, information in 1 million optic nerve fibers is expanded into more than 100 million primary visual cortical neurons.  Also in the cerebellum, a small number of mossy fibers target a large number of granule cells, creating a 100-fold expansion.”
  28. Can design ANNs that do either random dimension reduction or expansion, both require just 2-layers
  29. In general, you need more data than neurons in order to train a system.  If the data lies on a lower dimensional manifold you can get away with less
  30. For classification “A remarkable theoretical result is that if synapses are constrained to be either excitatory or inhibitory, then near capacity, the optimal solution is sparse, with most of the synapses silent (…) even if the input patterns themselves show no obvious sparse structure  This result has been proposed as a functional explanation for the abundance of silent synapses in the cerebellum and other brain areas.”
    1. If weights are unconstrained, optimal solutions are sparse, but not in the basis of neurons.  Instead it can be expressed as a linear combination of support vectors
  31. Because RPs basically maintain Euclidian distances they also maintain margins in the original space
  32.  <May not have finished this>
Tagged

Automated Derivation of Behavior Vocabularies for Autonomous Humanoid Motion. Jenkins, Mataric. AAMAS 2003

  1. Automatic derivation of motion primitives and means of combining them from human motion data using isomap
  2. For robot control, useful to have a restricted set of primitives to search over to do motion instead of planning in full space which is huge.  Sometimes this set can be defined by hand, but this has obvious limitations (time-intensive, requires domain expertise, even then may be incorrect)
  3. Paper extends isomap to extract spatio-temporal structure
    1. Finding motion primitives relies on additional clustering and interpolation
  4. Here they use the method for motion synthesis
  5. temp
  6. “In Motion Textures, <another approach> motion segments are learned so they can be accurately reproduced by a linear dynamical system within some error threshold.”
    1. Here the goal is “… to produce primitive behaviors that have an observable theme or meaning.”
    2. Method here requires some domain knowledge
  7. A number of references, but this work and the motion texture seem the most relevant
  8. “Each  extracted feature represents a group of motions with the same underlying theme and can then be used to construct a primitive motion module realizing the theme.”
  9. Using straight PCA has been used, but the features that come out from that aren’t really useful
    1. Can also do PCA and then clustering on top of that, but unless the # of clusters is known a-priori, the results also come out pretty meaningless.  Also the clustering just gives meaningful results on data that falls within the corpus used.  It doesn’t extrapolate well
  10. Using vanilla isomap, locally linear embedding, and kernel pca will all have same qualitative drawbacks as vanilla pca
  11. Here they do Isomap, but add temporal component
  12. Screen Shot 2014-11-03 at 11.16.13 AM
  13. “The Isomap embedding unravels the spatial structure of the S-cure removing the ‘s’ nonlinearity, producing the flattened data indicative of the 2-manifold structure of the S-curve.  However, the model that has generated the data are a 1-manifold with an S-curve nonlinearity and multiple translated instances.  Spatio-temporal Isomap produces an embedding indicative of this 1-manifold structure.  This embedding both unravels the S-curve nonlinearity and collapses corresponding points from multiple instances of the S-curve to a single point.”
  14. The idea behind isomap (as well as kernel PCA) is to do eigen decomposition on a similarity matrix in feature space as opposed to a covariance matrix (which is used in PCA)
    1. “The feature space is a higher dimensional space in which a linear operation can be performed that corresponds to a nonlinear operation in the input space.  The caveat is that we cannot transform the data directly to feature space.  For performing PCA in a feature space, however, we only require the dot-product (or similarity) between every pair of data points in feature space.  By replacing the covariance matrix C with the similarity matrix D, we fit an ellipsoid to our data in feature space that produce nonlinear PCs in the input space.”
  15. Spatio-temporal isomap works the same way, but an extra step is added to deal with temporal dependencies.  As compared to vanilla isomap “If the data have temporal dependency (i.e., a sequential ordering), spatial similarity alone will not accurately reflect the actual structure of the data.”
  16. They tried two method of dealing with temporal dependencies
  17. Spatial neighbors are replaced by adjacent temporal neighbors, which introduces a first-order Markov dependency into the embedding
  18. Creates connected temporal neighbors (CTNs), as well as CTN connected components
    1. “Points not in this CTN connected component will be relatively distal.  Thus, CTN connected components will be separable in the embedding through simple clustering.”
  19. Processing kinematic motion:
    1. 1st iteration makes clusterable groups of motion, primitive feature groups
    2. Interpoliation is done on primitive feature groups to make parameterized primitive motion modules
    3. Spatio-temporal isomap is again applied to first embedding to make more clusterable groups of motion called behavior feature groups
    4. From a behavior feature group, meta-level behavior encapsulates component primitives and links them with transition probabilities
  20. Feels like its doing things pretty similar to slow feature analysis, “Incremental Slow Feature Analysis:
    Adaptive Low-Complexity Slow Feature Updating from High-Dimensional Input Streams” does mention this work as similar but a little different.
  21. First step requires motion segmenting.  They use both ground truth as well as an algorithm called kinematic centroid segmentation 
    1. SFA doesn’t require this, the segmenting falls out by itself
    2. <I think this is a bit of a deal breaker for us>
  22. Seems like they use a pretty simple clustering method <they call it “sweep and prune”, never heard of it>
  23. Set of motion segments in each feature group are treated as exemplars.  Variations on that can be constructed by interpolating (many different ways of doing the interpolation) between exemplars
  24. Processing sequence does:
    1. primitive features -> behavior features -> meta-level behaviors
  25. <skimming the rest because the segmentation requirement for initial processing probably makes this unusable for us>
Tagged ,

Learning Representation and Control in Markov Decision Processes: New Frontiers. Mahadevan

  1. Approach is to solve MDPs by iteratively considering low-D reps and approx optimal policies
  2. “Exact solutions of discounted and average-reward MDPs are expressed in terms of a generalized spectral inverse of the Laplacian called the Drazin inverse.”
  3. Algorithm is called representation policy iteration (RPI)
    1. Model-free and model-based versions are given
  4. “Two approaches for dimensionality reduction of MDPs are described based on geometric and reward-sensitive regularization, whereby low-dimensional representations are formed by diagonalization or dilation of Laplacian operators.”
  5. “… the aim is to solve MDPs by automatically finding ‘lowdimensional’ descriptions of ‘high-dimensional’ functions on a state (action) space.”
    1. The relevant functions are the policy, transitions, and value
  6. Method tries to find representations based on the underlying manifold dynamics
    1. Means method is usable in continuous (state) problems
  7. In the Laplacian off-diagonal elements are nonpositive, and row sums are 0
    1. Means they are singular and therefore in general don’t have a direct inverse
    2. Moore-Penrose pseudoinverse is the most common method when direct inverse isn’t possible
    3. But something called the Drazin inverse is for spectral cases and is very important in study of Markov chains
  8. In particular, for an MDP M with transition matrix T, care about
    1. A=I-T
    2. As well as the Drazin inverse of A
    3. But then the go on to call AL, and T, P, so L=I-P
  9. Getting a good low-dimensional representation depends on finding the right basis functions that span the underlying manifold
  10. Deciding what basis to use is fundamentally about making tradeoffs – for example, in a discrete domain, using tabular methods will yield optimal results but will be expensive and noncompact; using a sparser representation is cheaper in some ways <although with my critical CS hat on, I think the big-O costs are really the same, as there are order-3 space and computational costs in the linear algebra that don’t go away even when using fewer bases>
  11. “… the basis construction problem may require as much or more computational effort than required to solve the original MDP.”
    1. On the other hand, in the case where transfer can be utilized, this extra cost can me amortized across instances
  12. Two main classes of methods are considered for basis construction in MDPs:
    1. Diagonalization (of the graph Laplacian)
    2. Dilation: “For example, a dilation operator on the space of functions on real numbers is Tf(x) = f(2x).  Several dilation-based approaches will be compared, including methods based on Krylov spaces, a standard approach of solving systems of linear equations […].  Applied to MDPs, this approach results in the reward function being ‘dilated’ by powers of some operator, such as the transition matrix […].  A novel basis construction method called Drazin bases is described in this paper, which uses the Drazin inverse of the Laplacian LD… the discounted value function of a MDP can be written in terms of a series of powers of the Drazin inverse.”
    3. Another option is wavelets, or more specifically for this case, diffusion wavelets are also based on dilation.
  13. Considers average and discounted reward cases (for average reward cases, value is called gain)
  14. Mentions Laplacians for directed graphs <but I’m pretty sure this isn’t an option for arbitrary directed graphs, only special cases.>
  15. In general, need to consider “generalized inverses” of the Laplacian – it may be noninvertible if it is not of full rank
    1. There are 6 properties the actual inverse needs to satisfy; the Moore-Penrose pseudo inverse satisfies 4 and will be used for inverses related to value functions
    2. The Drazin inverse is important in the study of Markov chains
  16. <Paper is quite long – skimming most of it>
  17. The graph Laplacian induces a reproducing kernel hilbert space (RKHS)
  18. Approximate methods for solving MDPs are discussed: Least-squares, linear programming, Hilbert space
    1. “All these approaches depend crucially on a choice of basis functions for constructing a low-dimensional representation of the MDP, and assume these are provided by the human designer. <I think topic of basis construction is addressed in a later chapter>”
  19. “An elegant and general way to formalize value function approximation builds on the framework of aproximation in a type of vector space called a Hilbert space […].”
  20. A Hilbert space is a vector space where length is defined as an inner product between any two vectors (for example, Euclidian space is a Hilbert space, with inner product simply being fTg.
  21. For MDPs, distance has to be weighted according to frequency of visitation of states
  22. If bases are orthonormal, projection to find most similar elements can be defines by abstract Fourier series
  23. Least-squares approaches consist of those based on Bellman residual, and that of fixed points
    1. <This stuff doesn’t work, moving along>
  24. LSTD
  25. LSPI
  26. VI via RKHS is related to support vector regression
  27. “… it is possible to show that a least-squares approximation of the value function on a basis matrix Φ is equivalent to solving an exact ‘low-dimensional’ MDP…”
  28. Given bases and policy, an approximate low-D reward and transition model are defined
  29. By Parr: Given bases, the exact solution to approx policy evaluation defined by low-D rewards and transitions is the same as the fixed point solution of exact policy evaluation on the full MDP projected onto the bases.
    1. The computational complexity is reduced from cubic in n <the number of states I assume> to cubic in k <the number of bases I assume>
  30. “For continuous <state> MDPs, the basis functions are represented on a set of ‘sampled’ states.”
  31. Considers cases of reward-aware and reward-agnostic basis functions
  32. Construction of bases according to policies is done during RPI
    1. “It is possible to make the policy-specific bases additionally customized to a particular reward, or indeed, to have them invariant to the reward function.  At every new round of policy iteration, a new set of bases will have to be generated.  Consequently, in this formulation, the bases constructed have a fairly short life span, and multiple sets of bases have to be constructed for solving just a single MDP. On the other hand, by tuning the bases to a specific policy and reward function, it is possible to develop some theoretical guarantees on how well they can approximate Vπ.”
  33. There are also incremental (one basis at a time) and batch methods (all at once) of basis construction
    1. Bases may also be constructed incrementally from samples taken from the MDP
  34. One method of basis construction is state aggregation
    1. Discusses clustering according to Bellman residual – error bounds are given
  35. Onto “Finding Invariant Subspaces of a MDP”
    1. “Unfortunately, transition matrices Pa are often not reversible, in which case, diagonalization leads to complex eigenvectors. One solution is to replace Pa with a related stochastic matrix, i.e., reversible and whose eigenspaces provide a suitable space to compress a MDP.”
    1. On the other hand “For the sub-class of diagonalizable transition matrices represented by reversible Markov chains, the transition matrix is not only diagonalizable, but there is also an orthonormal basis.”
  36. <Timing out on this one – think I basically get the idea from his other works>

Submodular Meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation, and Dictionary Selection. Das, Kempe. ICML 2011

  1. How do you select a subset of k variables from a large set in order to get the best linear prediction of another variable?
    1. To be clear, its about choosing dimensions of data, as opposed to selecting subset of whole data points
  2. Related to feature selection and sparse approximation
  3. Analyzes performance of commonly used, empirically effective greedy heuristics
    1. Analysis based on submodularity and spectral analysis
  4. Introduces the “… submodularity ratio as a key quantity to help understand why greedy algorithms perform well even when the variables are highly correlated.”
  5. Get best approximation guarantees in terms of both submodularity ratio as well as “… smallest k-sparse eigenvalue of the covariance matrix.”
  6. Also get better bounds on dictionary selection problem <not sure what that is>
  7. Test on real-world as well as synthetic data
    1. Results show submodularity ratio is a better predictor of performance of greedy algorithms than other spectral parameters
  8. Commonly, after subset selection is performed, the goal is to minimize MSE or maximize squared multiple correlation R2 <whats that? sounds neat>.  Here they focus on the latter
  9. Selection criteria is based on covariance matrices
  10. Minimizing R2 is “… equivalent to the problem of sparse approximation over dictionary vectors…”
  11. Formulation is similar to that of sparse recovery, except there the assumption is the data is actually k-sparse, but in many cases that assumption doesn’t hold and the data is actually dense
  12. The problem as specified is NP-Hard, so heuristics are the only way to go
  13. Common heuristics are greedy methods, or those based on convex relaxation.  Here they focus on the former, as the methods are generally simpler and require less tuning
  14. Of the greedy methods, most common are Forward Regression and Orthogonal Matching Pursuit
  15. Previous theoretical work on these greedy methods have been lacking in applicability
  16. “We prove that whenever the submodularity ration is bounded away from 0, the R2 objective is ‘reasonably close’ to submodular, and Forward Regression gives a constant-factor approximation.”
  17. Although they mention issues with spectral methods in this context (“… greedy algorithms perform very well, even for near-singular input matrices.”) the covariance ratio is related to spectral analysis:
    1. The submodularity ratio “… is always lower-bounded by the smallest k-sparse eigenvalue of the covariance matrix [but can be much larger].”
  18. They also get bounds for the two greedy methods when the lowest k-sparse eigenvalue is non-zero
  19. In comparison between performance as related to submodularity ratio vs spectral analysis “… while the input covariance matrices are close to singular, the submodularity ratio actually turns out to be significantly larger.  Thus our theoretical results can begin to explain why, in many instances, greedy algorithms perform well in spite of the fact that the data may have high correlations.”
  20. R2 describes the fraction of the variance of the desired output values Y that are explained by variables in the X inputs in the corpus
  21. Forward regression is the natural greedy method, where the variable that is added at each step is the one that most maximizes R2 immediately
  22. Give a bound for how much  the R2 values of a set and a sum of its elements can differ, used for the proof of performance of forward regression
  23. <Skipping actual proofs, onto empirical results>
  24. “Because several of the spectral parameters (as well as the optimum solution) are NP-hard to compute, we restrict our experiments to data sets with n<30 features, from which k<8 are to be selected.  We stress that the greedy algorithms themselves are very efficient.”
  25. Data sets are World Bank development indiccators, house prices in Boston w/relevant features, and then a synthetic set
  26. Compared to optimal (exhaustive) results, the greedy algorithms perform extremely well
  27. The greedy methods work better than Lasso
  28. The bound based on the submodularity ratio is much better than that of the spectral parameters
    1. While there is a fair amount of looseness between the theoretical bounds and actual performance (theoretical is pessimistic), the difference between actual and theoretical is not absurdly different
    2. They also include an extension to the theory that drastically tightens the bounds <why not put this in the theoretical section in the first place, as opposed to with the empirical results?>
  29. The real-world data sets are nearly singular

Regularized Off-Policy TD-Learning. Liu, Mahadevan, Liu. NIPS 2012

At this point, I don’t have time to give this paper justice, so consider this pretty close to an abstract

  1. Another regularized TD paper
  2. This algorithm is called RO-TD, is off policy and converges
  3. On the other hand, “Although TD converges when samples are drawn ‘on-policy’ … it can be shown to divergent when samples are drawn ‘off-policy’.”
  4. Mentions LARS-TD and LCP-TD
  5. “In this paper, the off-policy TD learning problem is formulated from the stochastic optimization perspective.  A novel objective function is proposed based on the linear equation formulation of the TDC  algorithm [used in LS-TD].  The optimization problem underlying off-policy TD methods, such as TDC is reformulated as a convex-concave saddle-point stochastic approximation problem, which is both convex and incrementally solvable.”
  6. <Not sure if this is simpler or more complex than the other L1 regularizations of LSTD>
  7. Gives an example (well known) domain where LSTD and LARS-TD diverges but RO-TD and TDC converge
  8. Seems like TDC actually outperforms RO-TD, but then in the next example, TDC fails and RO-TD and LARS perform properly

LSTD With Random Projections. Ghavamzadeh, Lazaric, Maillard, Munos. NIPS 2010

Another cursory read through

  1. Considers case where #features is larger than #samples
  2. Analyzes LSTD when features are random projections
  3. Theoretical paper that provides performance bounds
  4. LSTD finds the value function of a fixed policy (that is why algorithms like LSTD with L1 regularization are paired with some form of policy iteration, often LSPI)
  5. Show the impact of errors during subsequent steps of DP
  6. Perfomance bound given for resulting LSPI algorithm
  7. Tradeoff is reduction in estimation <testing?> error but increase in approximation <training?> error
  8. Works off of data from a trajectory
  9. Cost of LSTD-RP is O(d3+ndD)
    1. d is dimension of projected space, D is dimension of original space, n is number of samples in trajectory
  10. Vanilla LSTD has cost of O(D3+nD2)
  11. Analysis of performance shows d should be √n
    1. In this case, LSTD-RP cost is O(n3/2D) which is better than O(D3) when n < D
    2. On the other hand, cost of prediction in LSTD-RP is O(dD), whereas in LSTD its O(D)
  12. They give a horribly complex error bound
  13. Assume random projections drawn from 0-centered Gaussian
  14. There is also a simple bound given for L2 regularized LSTD
  15. A crazily complex bound is given for LSTD-RP with LSPI

Finite-Sample Analysis of Lasso-TD. Ghavamzadeh, Lazaric, Munos, Hoffman. ICML 2011.

Brief notes from a very light read – look at the authors and you know that you can either read this paper in 5 minutes, or 5 weeks

  1. Theoretical results encompass both LARS-TD and LC-TD.  They call their method Lasso-TD.
  2. Methods form a contraction mapping, so they converge
  3. A Lasso-TD solution always exists (although it may not be unique) and has a unique value function
  4. Performance bounds are derived “estimation error with the rate O(||α||1 n -1/4).”  Additional assumptions improve this to O(||α||0 n -1/2)
  5. Theorem connects “… prediction performance of Lasso-TD (which is an L1-regularized fixed point problem) and the L1-norm of the best approximation of V in Fn.”
  6. As number of features d increases wrt number of samples n, performance of vanilla LSTD gets continually worse, while Lasso-TD only degrades at (~O(log d))
  7. “… in order for Lasso-TD to be effective, a sparse approximation of the value function must exist.”
    1. They go on to talk about some properties of an MDP that would cause this property to exist
  8. Domains with smooth rewards and dynamics will have smooth value functions.
    1. “From approximation theory, we know that piecewise-smooth functions can be accurately approximated by localized features, such as wavelets (…).  As a result, if we define F as the linear function space spanned by a set of d wavelets, any value function is likely to be well-approximated by functions in F with a relatively small number of non-zero coefficients.”
  9. Also mentions LSTD-RP (LSTD with random projections <on my reading list>).
    1. In the general unrestricted case, LSTD-RP works better than Lasso-TD.
    2. When sparsity does exist, however, Lasso-TD works better

Dimension reduction and its application to model-based exploration in continuous spaces. Nouri, Littman. Machine Learning 2010

Just a note that I read this paper while I didn’t have ability to take notes, so notes are pretty sparse here.  If I go down the road of dimensionality reduction / feature selection in RL I should cite this, but probably the Parr, Ng papers on L1 regularization in TD are more relevant at the moment.

  1. Addresses exploration in high dimensional domains
  2. Uses a continuous value of knowness <C-KWIK?> to drive exploration smoothly, is closer in spirit to MBIE.  This is better than the standard binary known/unknown labels that R-Max uses
  3. Does learning according to KNN with RBFs, learns RBF parameters (σ) for each component of the output vector independently, which allows for more compact representations
  4. Performance degrades very slowly as a function of problem size, generalization and accuracy are very good as related to methods that do not attempt dimension reduction

Linear Complementarity for Regularized Policy Evaluation and Improvement. Johns, Painter-Wakefield, Parr. NIPS 2010

Based on a very rapid read-through before I go home

  1. Also for doing L1 regularization for feature selection
  2. Here, the problem is formulated as “… a linear complementarity problem (LCP).” <I don’t know what that means off the top of my head>
  3. Similar to LARS-TD, but has a few advantages
    1. Easier to apply off-the shelf tools
    2. Can be warm started
  4. LCPs can describe any convex quadratic problem, even though the basic setup is pretty simple
  5. Turns out this approach is mathematically very similar to LARS but this method is computationally cheaper
    1. <Confusing they say that> Their approach is more expensive than LARS
  6. Also, LARQ (their proposal) can get “stuck” and its difficult to correct. “This occurs when the greedy policy for a particular β is not well defined.  In such cases, the algorithm attempts to switch to a new policy immediately following a policy change.  This problem is not unique to LARQ.  Looping is possible with most approximate policy iteration algorithms.   What makes it particularly troublesome for LARQ is that there are few satisfying ways of addressing this issue without sacrificing the invariants.”
  7. They present some changes that attempt to work around these issues (for example it can’t get stuck), and call this algorithm LC-MPI
  8. In experiments they use LARS-TD and LC-TD within policy iteration
    1. Experiments are used “… a single value of the L1 regularization parameter…”
  9. Experiments on 20-state chain, as well as mountain car
  10. In the chain “For features, we used 1000 Gaussian random noise features along with five equally spaced radial basis functions … and a constant function”
    1. So there is a ton of garbage and a small amount of actual feature
    2. <On the other hand, Ali’s stuff shows it is very simple to get rid of features that are complete garbage, like white noise.  Its much harder to throw out features that carry information, but from other tasks>
  11. <They just rely on very short episodes with random state initialization to do the data collection/exploration>
  12. “When LARS-TD and LC-TD are used as subroutines within policy iteration, the process ends at a single value of the L1 regularization parameter β.  The policy iteration loop must be rerun to consider different values of β.  In this section, we show how much computation can be saved by running LC-MPI once (to produce m greedy policies, each at a different value of β) versus running policy iteration m separate times.”
    1. Warm starting helps a bunch

Context-dependent Computation by Recurrent Dynamics in Prefrontal Cortex. Mante, Sussillo, Shenoy, Newsome. Nature 2013.

  1. Study behavior of macaque behavior in the face of noisy stimulus
  2. Although individual behavior is complex and difficult to tie down, at a population level analysis is more fruitful and can be related to a simulated recurrent NN
  3. “This mechanism indicates that selection and integration are two aspects of a single dynamical process unfolding within the same prefrontal circuits, and potentially provides a novel, general framework for understanding context-dependent computations.”
  4. Behavior can change drastically and rapidly according to changing context, related to gating
    1. “These observations have led to the hypothesis that early selection may account for the larger effect of relevant as compared to irrelevant sensory information on contextually sensitive behaviour.”
  5. In the task studied here, context-dependent behavior is required based on visual stimulus
  6. Neural responses recorded “… in and around the frontal eye field … an area of PFC involved in the selection and execution of saccadic eye movements, the control of visuo-spatial attention, and the integration of information towards visuomotor decisions.
  7. “We found no evidence that irrelevant sensory inputs are gated, or filtered out, before the integration stage in PFC, as would be expected from early selection mechanisms.  Instead, the relevant input seems to be selected late, by the same PFC circuitry that integrates sensory evidence towards a choice.  Selection within PFC without previous gating is possible because the representations of the inputs, and of the upcoming choice, are separable at the population level, even though they are deeply entwined at the single neuron level.”
  8. Monkeys were able to act appropriately based on cue and ignore irrelevant information
  9. “As is common in PFC, the recorded responses of single neurons appeared to represent several different task-related signals at once, including the monkey’s upcoming choice, the context, and the strength of motion and color evidence (…).”
    1. So instead they attempt to understand activity at population level
  10. Most units were recorded during separate sessions <Hm…>
  11. The task involved stimulus with colored moving dots.  In some cases the color was more important for decision making and in other cases the motion was
  12. Population responses are represented as trajectories. “We focused our analyses on responses in a specific low-dimensional subspace that captures across-trial variance due to the choice of the monkey …” also based on motion, color information and which of those was relevant for decision making in that context
    1. Uses PCA and then a projection onto independent axes of choice, motion, color, and context
  13. “This population analysis yields highly reliable average response trajectories (…) that capture both the temporal dynamics and the relationships among the task variables represented in PFC.”
  14. For example, trajectories go in opposite directions based on saccade direction
    1. Likewise, patterns of activation of PFC “… are very different from those corresponding to either choice…”
  15. “Indeed, the population response does not follow straight paths along the choice axis, but instead forms prominent arcs away from it (…).  The magnitude of each arc along the axes of motion or colour reflects the strength of the corresponding sensory evidence (…), whereas its direction (up or down) reflects the sign of the evidence.”
  16. “… the signals along axes of motion and colour are transient — the arcs return to points near the choice axis by the time of dots offset.”  Because they come and go, they are called momentary evidence
  17. “Third, context seems to have no substantial effect on the direction of the axes of choice, motion and color, and only weak effects on the strength of the signals represented along these axes.”  So 3 axes are sufficient to represent the data
  18. Although direction of trajectories are invariant to context, where those lie in the state space are clearly separated spatially when considering context
  19. <Tried to upload graph of  results but WP won’t cooperate right now… anyway> Depending on what the context is (selection by color or motion), projecting the data one way or the other will either allow the data to be trivially linearly separable (with a line running through the origin), or will not be linearly separable
  20. There is however, fair separability even on items that have no behavioral effect
    1. This is unsupported by the early selection model, so there must be something else going on
  21. Then they made a recurrent network and trained it to carry out the task
  22. “As in PFC, the contextual input does not affect the strength of the sensory inputs–selection occurs within the same network that integrates evidence towards a decision.”
  23. “After training, the model qualitatively reproduces the monkeys’ behaviour, confirming that the model solves the selection problem at the ‘behavioral’ level (…).”
  24. Analysis of NN population has same characteristics as those in monitored neurons
    1. “… integration of evidence corresponds to gradual movement of the population response along the choice axis.”
    2. “… momentary motion and color evidence ‘push’ the population away from the choice axis, resulting in trajectories that are parametrically ordered along the motion and color axes.”
    3. “… direction of the axes of choice, motion and color are largely invariant with context, as are the strength of the motion and color inputs, as these are not gated before entering the network.”
    4. “… trajectories during motion and colour contexts are separated along the axis of context (…).”
  25. The main difference between neural and NN responses is that “… signals along the input axes are transient in the physiology, but not in the model, yielding PFC trajectories that curve back to the choice axis before the end of the viewing interval (…).   This difference suggests that the sensory inputs to PFC are attenuated after a decision is reached.”
  26.  “Notably, the rich dynamics of PFC [and ANN] responses during selection and integration of inputs can be characterized and understood with just two features of a dynamical system–the line attractor and the selection vector, which are defined only at the level of the neural population.  <I don’t know what either of those two things are>  This parsimonious account of cortical dynamics contrasts markedly with the complexity of single neuron responses typically observed in PFC and other integrative structures, which reveal multiplexed representation of many task-relevant and choice-related singals.  In light of our results, these mixtures of signals can be interpreted as separable representations at the level of the neural population.  A fundamental function of PFC may be to generate such separable representations, and to flexibly link them through appropriate recurrent dynamics to generate the desired behavioral outputs.”
Tagged