Category Archives: Feature Construction

Autonomous reinforcement learning on raw visual input data in a real world application. Lange, Riedmiller, Voigtlander. Neural Networks (IJCNN) 2012

  1. Learn slot car racing from raw camera input
  2. Deep-fitted Q
  3. The slot car problem is interesting in particular because as being rewarded for speed, the best policies are the closest to failure (losing control of the car)
  4. Time resolution at about 0.25s
  5. Camera input is 52×80 = 4160d (greyscale it seems?)
  6. They do pretraining: “The size of the input layer is 52×80 = 4160 neurons, one for each pixel provided by the digital camera. The input layer is followed by two hidden layers with 7×7 convolutional kernels each. The first convolutional layer has the same size as the input layer, whereas the second reduces each dimension by a factor of two, resulting in 1 fourth of the original size. The convolutional layers are followed by seven fully connected layers, each reducing the number of its predecessor by a factor of 2. In its basic version the coding layer consists of 2 neurons.
    Then the symmetric structure expands the coding layer towards the output layer, which shall reproduce the input and accordingly consists of 4160 neurons.”
  7. “Training: Training of deep networks per se is a challenge: the size of the network implies high computational
    effort; the deep layered architecture causes problems with vanishing gradient information. We use a special two-stage training procedure, layer-wise pretraining [8], [29] followed by a fine-tuning phase of the complete network. As the learning rule for both phases, we use Rprop [34], which has the advantage to be very fast and robust against parameter choice at the same time. This is particularly important since one cannot afford to do a vast search for parameters, since training times of those large networks are pretty long.”
  8. Training is done with a set of 7,000 images, with the car moving a constant speed.
  9. “For the layer-wise pretraining, in each stage 200 epochs of Rprop training were performed.”
  10. “Let us emphasis the fundamental importance of having the feature space already partially unfolded after pretraining. A partially unfolded feature space indicates at least some information getting past this bottle-neck layer, although errors in corresponding reconstructions are still large. Only because the autoencoder is able to distinguish at least a few images in its code layer it is possible to calculate meaningful derivatives in
    the finetuning phase that allow to further “pull” the activations in the right directions to further unfold the feature space.”
  11. “Altogether, training the deep encoder network takes about 12 hours on an 8-core CPU with 16 parallel threads.”
  12. Because still images don’t include all stat information (position but not velocity) they need to use some method to allow the other parts of the state information in.  One option is to make the state the previous two frames, the other is to do a Kohonen Map <I assume this is what they do>
  13. “As the spatial resolution is non uniform in the feature space spanned by the deep encoder, a difference in the feature space is not necessarily a consistent measure for the dynamics of the system. Hence, another transformation, a Kohonen map … is introduced to linearize that space and to capture the a priori known topology, in this case a ring.”
    1. <Need to read this part over again>
  14. Use a paricular form of value FA I haven’t heard of before
  15. It seems this VFA may be an averager, not based strictly on an NN
  16. 4 discrete actions
  17. “Learning was done by first collecting a number of ’baseline’ tuples, which was done by driving 3 rounds with the constant safe action. This was followed by an exploration phase using an epsioln-greedy policy with epsilon= 0.1 for another 50 episodes. Then the exploration rate was set to 0 (pure exploitation). This was done until an overall of 130 episodes was finished. After each episode, the cluster-based Fitted-Q was performed until the values did not change any more. Altogether, the overall interaction time with the real system was a bit less than 30 minutes.”

Predictive Projections. Sprague. IJCAI 09.

  1. algorithm for linear “projections in which accurate predictions of future states can be made using simple nearest neighbor style learning.”
  2. Experiments on simulated pendulum and real roomba robot
  3. Considers video data
  4. Based on earlier work (by Tesauro among others) in “which nearest neighbor learning is recast in a probabilistic framework that allows for gradient based optimization of the distance metric.  These existing algorithms discover projections of the training data under which nearby points are likely to have the same class label or similar regression targets.”
    1. Neighborhood Components Analysis (NCA) – for nearest neighbor classification
  5. “It would be difficult to directly find an A that minimizes the k-nearest-neighbor classification error rate, because the number of errors will be a highly discontinuous function of A; very small changes in A may change the set of nearest neighbors for some points. The innovation behind the NCA algorithm is to recast nearest neighbor learning in a probabilistic framework. In this framework, expected error is a continuous, differentiable function of A and thus may be minimized using gradient based techniques.”
  6. NCA was originally designed for classification but was later extended to regression
  7. A minor tweak is needed to get NCA for regression to work well on predicting vectors when the problem is stochastic
  8. Here they use conjugate gradient to do the optimization
  9. The approach can scale well; a paper from ’07 was able to run on 60k training points
  10. “Following this gradient acts to reduce the objective in two different ways. It adjusts A to be more predictive by increasing the probability that a neighbor will be chosen if it successfully predicts the next state. It also adjusts A to be more predictable, by moving target states together whenever there is a high probability they will be chosen to predict each other.”
  11. Important to whiten data first
  12. To deal with actions, partition data by action but constrain to use same projection <how is this different than just throwing out the action data>
  13. Only parameters to the algorithm are size of projection matrix A and its initial values
  14. “The predictive projections algorithm as described above may not perform well in cases where the effects of different actions are restricted to specific state dimensions. Since there is no explicit penalty for failing to predict some dimensions, the algorithm may minimize the objective function by finding an A which is not full rank, thus accurately predicting some dimensions while discarding others.”
    1. Although there is a potential fixes for this
  15. For the applications this method is coupled with LSPI (along with RBFs)
  16. Algorithm deals well with Lagoudakis+Parrs’ LSPI
  17. For the roomba experiment they put a camera on top – all the robot needs to do is to continue to move forwards without hitting the wall
    1. <Seems like not a terribly hard task just turn when the wall takes up most of the view?>
  18. temp
  19. Image is 20×20: 1200-d once color is factored in
    1. Then this is reduced via PCA to 25d
    2. Then their approach takes that to 2d (A is initialized to the first 2 principal components)
  20. Their approach basically recovers translation and rotation, while PCA recovers amount of wall present and lighting (their approach can use this to make a good policy, but that can’t be done linearly from what PCA produces)
  21. “To our knowledge, the proto-value function framework has not been applied to the type of noisy, high dimensional control problems addressed in this paper. It seems likely that the neighborhood calculations required for constructing the diffusion model could be dominated by noise dimensions, particularly in very noisy tasks such as the modified pendulum domain described above. In that case, the PVF approach and
    predictive projections would be complementary: The PP algorithm could find a low dimensional state projection that contains relevant state information, and the PVF algorithm could then be used to discover a set of appropriate basis functions in that space”
  22. “Another closely related project is the basis iteration algorithm described in [Sprague, 2007]. This algorithm also uses gradient based metric learning to discover an appropriate projection, but it focuses directly on finding a metric that allows for accurate estimation of the optimal value function. It accomplishes this by iterating value function estimation with updates to the projection matrix. This algorithm has the advantage of incorporating reward information, but it depends on starting with an initial projection that enables a reasonableestimate of the optimal value function.”

State Representation Learning in Robotics: Using Prior Knowledge about Physical Interaction. Jonschowski, Brock. RSS 20Re10Q

related to https://aresearch.wordpress.com/2015/04/18/learning-task-specific-state-representations-by-maximizing-slowness-and-predictability-jonschkowski-brock-international-workshop-on-evolutionary-and-reinforcement-learning-for-autonomous-robot-s/

  1. Uses the fact that robots interact with the physical world to set constraints on how state representations are learnt
  2. Test on simulated slot car and simulated navigation task, with distractors
  3. How to extract a low dimensional representation relevant to the task being undertaken from high dimensional sensor data?
  4. The visual input in the experiments is 300-D
  5. From the perspective of RL
  6. “According to Bengio et al. [1], the key to successful representation learning is the incorporation of “many general priors about the world around us.” They proposed a list of generic priors for artificial intelligence and argue that refining this list and incorporating it into a method for representation learning will bring us closer to artificial intelligence.”
  7. “State representation learning is an instance of representation learning for interactive problems with the goal to find a mapping from observations to states that allows choosing the right actions. Note that this problem is more difficult than the standard dimensionality reduction problem, addressed by multi-dimensional scaling [14] and other methods [23, 29, 6] because they require knowledge of distances or neighborhood relationships between data samples in state space. The robot, on the other hand, does not know about semantic similarity of sensory input beforehand. In order to know which observations correspond to similar situations with respect to the task, it has to solve the reinforcement learning problem (see Section III), which it cannot solve without a suitable state representation.”
  8. State representation learning can be done by:
    1. Deep autoencoders
    2. SFA (and its similarity to proto-value functions)
    3. Predictability / Predictive actions.  Points to a Michael Bowling paper <I haven’t read – will check out>
  9. Bunch of nice references
  10. The Robotic priors they care about (these are all defined mathematically later):
    1. Simplicity: For any task, only a small number of properties matter
    2. Temporal coherence: important properties change gradually over time
    3. Proportionality: Amount of change in important properties is proportional to action magnitude
    4. Causality: important properties and actions determine the reward
    5. Repeatability: Same as causality but in terms of transition not reward
  11. These properties hold for robotics and physical systems but aren’t necessarily appropriate to all domains
    1. Even in robotics these sometimes don’t hold (for example, a robot running into a wall will have an abrupt change in velocity; once pushed against the wall proportionality doesn’t hold because no amount of pushing will allow the robot to move
  12. They set learning up as an optimization problem that includes loss terms for each of the priors above (just a linear combination)
  13. <these priors aren’t directly applicable to the mocap work I am doing now because they involve actions which we dont have access to>
  14. Formally they would need to compare all samples, leading to n^2 loss, but they restrict this to a window
  15. Linear mapping from observations to states
  16. Epsilon greedy exploration, although there is a bias to repeat the previous action
  17. Have distractors in their simulations
  18. In the navigation task they demonstrate an invariance to perspective by using either an overhead or first-person view from the robot
    1. Representations learned are highly similar
  19. Learned in the course of 5,000 observations
  20. The observations in the experiment are 300 pixels or 100 (not 300×300)
  21. For the simulated slot car task, the state sample matrix has rank 4.  Two large eigenvalues correspond to the position of the controlled car, and two smaller eigenvalues correspond to the position of the distractor car
    1. Ideally the distractor shouldn’t show up at all, but because of things like stochasticity and limited samples, weight can be placed on it to explain events that it is not related to
  22. They then to an RL comparison based on a number of different methods for learning the representation (5 features extracted by)
    1. Their approach
    2. SFA
    3. PCA
    4. Raw 300D representation
    5. (They also compare to ground truth representation)
  23. Use neural fitted Q
  24. screenshot
  25. SFA features are really terrible, their approach is pretty similar to operating from the ground truth
  26. With further investigation on the same results they demonstrate that their method has very good generalization properties
  27. Conjecture that the primary difference between their approach and the other dimension reduction methods are that they didn’t learn to disregard the distractors

Autonomous Learning of State Representations for Control: An Emerging Field Aims to Autonomously Learn State Representations for Reinforcement Learning Agents from Their Real-World Sensor Observations. Bohmer, Springenberg, Beodecker, Riedmiller, Obermayer. Künstliche Intelligenz 2015.

  1. Trying to do robotics control directly from high-dimensional sensor readings.  The trick is that with robots experience is costly so you need to use data very efficiently <the deepmind guys were able to run each game for the equivalent of over a month because its just in simulation>
  2. Review 2 methods for extracting features: deep auto-encoders and slow feature analysis
  3. Mention deep learning’s deep Q network
  4. Until DQN, using FAs has been largely unsuccessful, and has generally worked only on toy domains <although Tesauro’s TD Gammon is a significant exception>
  5. “In the last years several researchers have also considered end-to-end learning of behavioral policies p, represented by general function approximators. First attempts towards this include an actor-critic formulation termed NFQ-CA [14], in which both the policy p and the Q-function are represented using neural networks and learning proceeds by back-propagating the error from the Q-network directly into the policy network—which was, however, only applied to low dimensional control problems.”
  6. Also cites deterministic policy gradient https://aresearch.wordpress.com/2014/06/09/deterministic-policy-gradient-algorithms-silver-lever-heess-degris-wierstra-riedmiller-icml-2014/, which worked on a 30-DOF arm
  7.  Cites deep fitted Q (same lab, papers from a couple of years prior that also was used on slot cars and an inverted pendulum from raw camera images
  8. The first part of deep fitted Q is unsupervised, setup as an autoencoder
  9. In order for the RL to work, it is important that the problem has low “intrinsic dimensionality” (the manifold it lies on)
    1. For example, the image a camera gets from a robotic inverted pendulum can be arbitrarily large, but the underlying dimension of the problem is just 2
  10. They do not assume a pomdp; assume that the high-dimensional input completely describes the problem <although this is almost certainly not actually the case with at least the slot car, but that is not a ding on their work it just makes it more impressive>
  11. The representation chosen for the state depends on an optimization problem
  12. Train an autoencoder to minimize MSE
  13. DFQ is a clustering-based approach<?>
    1. There are some problems with DFQ that they will discuss later, but:
    2. Representation learned by DFQ isn’t reward-based, so representation doesn’t care about that
    3. “learning auto-encoders for inputs with high variability (i.e. many objects of relevance) can be hard” Although says regularization can help this, and cites a number of other papers that deal with this
  14. “Despite these drawbacks DFQ also comes with advantages over end-to-end learning: since the auto-encoder merely learns to reconstruct sampled observations and it can be fed with samples generated by any sampling policy for any task, is thus less susceptible to non-stationarity of the training data.”
  15. Mentions things like proto-value functions, laplacian eigenmaps etc for discrete RL, relationship between SFA and LEM and PVF
  16. “From a theoretical point of view, SFA and PVF both approximate subspace-invariant features. This classification has its origin in the the analysis of approximation errors in linear RL [39]. Here subspace-invariant features induce no errors when the future reward is propagated back in time. It can be shown that under these conditions the least-squares temporal difference algorithm (LSTD, [9]) is equivalent to supervised least-squares regression of the true value function [5]. However, this is only possible for the class of RL-tasks with a self-adjoint transition model. As this class is very rare, both SFA and PVF substitute a self-adjoint approximation of the transition model to compute almost subspace-invariant representations.4 An analysis of the optimal solution5 shows that SFA approximates eigenfunctions of the symmetrized transition operator [50]. Moreover, with a Gaussian prior for the reward, one can show that SFA representations minimize a bound on the expected LSTD error of all tasks in the same environment [5]. However, as the solution depends on the sampling distribution, straight forward application for transfer learning is less obvious than in the case of PVF. Future works may rectify this with some sensible importance sampling, though.”
  17. When the data completely covers the space, PVF and SFA are pretty equivalent, but when data is generated from a random walk, PVF fails and SFA still manages to do almost as well as when there is perfect coverage
  18. In less artificial settings/domains there is less of a performance gap between PFV and SFA but it still exists, SFA seems to do a better job generating features for LSTD and LSPI
  19. In these problems, the representation learned must:
    1. Maintain Markov property (no partial observability)
    2. Be able to represent the value function well enough to bootstrap from it and improve the policy
    3. Generalize well (especially to cases that may not be well covered by the training distribution)
    4. Low dimensional
  20. “In stochastic environments one can only compare probability distributions over future states based on a random policy, called diffusion distances. It can be shown that SFA approximates eigenfunctions of the symmetrized transition operator, which encode diffusion distances [5]. SFA features are therefore a good representation for nonlinear RL algorithms as well.”
    1. <Depends how you define “stochastic environments” I assume they mean environments where there is no choice in action selection at all because otherwise what is written is incorrect>
  21. “In summary, SFA representations X seem in principle the better choice for both linear and non-linear RL: nonlinear SFA extracts eigenfunctions of the transition model Pp, which are the same for every isomorphic observation space Z, encode a diffusion metric that generalizes to states with similar futures and approximates a Fourier basis of the (unknown) underlying state space.”
  22. Empirical work shows SFA works best when exposed to data that gets uniform coverage
  23. “a Fourier basis as approximated by SFA grows exponential in the underlying state dimensionality. Linear algorithms, which depend on this basis to approximate the value function, are therefore restricted to low dimensional problems with few or no variables unrelated to the task. Non-linear RL algorithms, on the other hand, could work in principle well with only the first few SFA features of each state-dimension/variable. e. The order in which these variables are encoded as SFA features, however, depends on the slowness of that variable. This can in practice lead to absurd effects. Take our example of a wheeled robot, living in a naturally lit room. The underlying state space that the robot can control is three-dimensional, but the image will also depend on illumination, that is, the position of the sun.”
    1. Describes this as being vulnerable to slow distractors
  24. “Also, auto-encoders minimize the squared error over all input dimensions of Z equally. This can produce incomplete representations if a robot, for example, combines observations from a camera with measurements from multiple joints. Due to the large number of pixels, small improvements in the reconstruction of the image can outweigh large improvements in the reconstruction of the joint positions.”
  25. Compares results from a deep auto encoder and one with a NN that has objectives of slowness and predictability.  The latter produces a much better output <but this is really a garbage comparison because the prior is based on actual physical experiments and the latter is based on an extremely oversimplified clean simulation.>
  26. “Take the example of uncontrollable distractors like blinking lights or activity outside a window. Each distractor is an independent variable of the isomorphic state S, and to learn an isomorphic representation X requires thus samples from all possible combinations of these variables.  The required training samples grow exponentially in the number of distractors.”
    1. <The XSENS suit has tons of blinking lights on it.  I was thinking about this as a potential distractor for the methods we are working on…>
  27. Moves on to a discussion of “factored representations and symbolic RL” at the end of the paper
  28. Basically discuss object oriented RL (discuss it in terms of relational RL)

Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning. Guo, Singh, Lee, Lewis, Wang. NIPS 2014

  1. Model-based Atari planners are better than the famous DQN, but very slow
  2. Goal here is to build a real-time MCTS agent better than DQN
  3. Use UCT to train NN, 300 max depth, 10k trajectories (minor modifications for a couple of games)
  4. A few particular approaches:
    1. Do 800 episodes in the game through UCTs policy, and learn the values computed by UCT to train the NN
    2. Use UCT as above, but to use NN to train a classifier to pick an action
      1. One issue is that the policy generated by these approaches will in general be different from that found by UCT during search, so the values will not be for the policy used
    3. So they propose an “interleaved” approach where part of the behavior is on-policy UCT, and part UCT is used to compute values, but the actions taken are according to NN
  5. Learn in 84×84 greyscale
  6. Use same NN arch as DQN
  7. UCT with classification is as good or better as DQN,  UCT with regression is worse
  8. To be consistent with DQN paper, they leave in 5% exploration for recording quality, although it doesn’t make any sense to do that when evaluating performance
  9. UCT without any realtime constraints really beats up on everything (roughly double score of best)
  10. Also, the emulator runs at 60 fps, not 30 that might be expected <the 60 is consistent with atari hardware, I checked>
  11. Interleaved method performance jumped %10 when doubling trajectories
  12. <I thought they were going to try a DYNA 2 approach, wonder how it would work>
  13. They observed some horizon effects in UCT where some good behavior requires really long timescales that couldn’t be reasoned about
Tagged

Linear Complementarity for Regularized Policy Evaluation and Improvement. Johns, Painter-Wakefield, Parr. NIPS 2010

Based on a very rapid read-through before I go home

  1. Also for doing L1 regularization for feature selection
  2. Here, the problem is formulated as “… a linear complementarity problem (LCP).” <I don’t know what that means off the top of my head>
  3. Similar to LARS-TD, but has a few advantages
    1. Easier to apply off-the shelf tools
    2. Can be warm started
  4. LCPs can describe any convex quadratic problem, even though the basic setup is pretty simple
  5. Turns out this approach is mathematically very similar to LARS but this method is computationally cheaper
    1. <Confusing they say that> Their approach is more expensive than LARS
  6. Also, LARQ (their proposal) can get “stuck” and its difficult to correct. “This occurs when the greedy policy for a particular β is not well defined.  In such cases, the algorithm attempts to switch to a new policy immediately following a policy change.  This problem is not unique to LARQ.  Looping is possible with most approximate policy iteration algorithms.   What makes it particularly troublesome for LARQ is that there are few satisfying ways of addressing this issue without sacrificing the invariants.”
  7. They present some changes that attempt to work around these issues (for example it can’t get stuck), and call this algorithm LC-MPI
  8. <Only got to here>

Automated Derivation of Behavior Vocabularies for Autonomous Humanoid Motion. Jenkins, Mataric. AAMAS 2003

  1. Automatic derivation of motion primitives and means of combining them from human motion data using isomap
  2. For robot control, useful to have a restricted set of primitives to search over to do motion instead of planning in full space which is huge.  Sometimes this set can be defined by hand, but this has obvious limitations (time-intensive, requires domain expertise, even then may be incorrect)
  3. Paper extends isomap to extract spatio-temporal structure
    1. Finding motion primitives relies on additional clustering and interpolation
  4. Here they use the method for motion synthesis
  5. temp
  6. “In Motion Textures, <another approach> motion segments are learned so they can be accurately reproduced by a linear dynamical system within some error threshold.”
    1. Here the goal is “… to produce primitive behaviors that have an observable theme or meaning.”
    2. Method here requires some domain knowledge
  7. A number of references, but this work and the motion texture seem the most relevant
  8. “Each  extracted feature represents a group of motions with the same underlying theme and can then be used to construct a primitive motion module realizing the theme.”
  9. Using straight PCA has been used, but the features that come out from that aren’t really useful
    1. Can also do PCA and then clustering on top of that, but unless the # of clusters is known a-priori, the results also come out pretty meaningless.  Also the clustering just gives meaningful results on data that falls within the corpus used.  It doesn’t extrapolate well
  10. Using vanilla isomap, locally linear embedding, and kernel pca will all have same qualitative drawbacks as vanilla pca
  11. Here they do Isomap, but add temporal component
  12. Screen Shot 2014-11-03 at 11.16.13 AM
  13. “The Isomap embedding unravels the spatial structure of the S-cure removing the ‘s’ nonlinearity, producing the flattened data indicative of the 2-manifold structure of the S-curve.  However, the model that has generated the data are a 1-manifold with an S-curve nonlinearity and multiple translated instances.  Spatio-temporal Isomap produces an embedding indicative of this 1-manifold structure.  This embedding both unravels the S-curve nonlinearity and collapses corresponding points from multiple instances of the S-curve to a single point.”
  14. The idea behind isomap (as well as kernel PCA) is to do eigen decomposition on a similarity matrix in feature space as opposed to a covariance matrix (which is used in PCA)
    1. “The feature space is a higher dimensional space in which a linear operation can be performed that corresponds to a nonlinear operation in the input space.  The caveat is that we cannot transform the data directly to feature space.  For performing PCA in a feature space, however, we only require the dot-product (or similarity) between every pair of data points in feature space.  By replacing the covariance matrix C with the similarity matrix D, we fit an ellipsoid to our data in feature space that produce nonlinear PCs in the input space.”
  15. Spatio-temporal isomap works the same way, but an extra step is added to deal with temporal dependencies.  As compared to vanilla isomap “If the data have temporal dependency (i.e., a sequential ordering), spatial similarity alone will not accurately reflect the actual structure of the data.”
  16. They tried two method of dealing with temporal dependencies
  17. Spatial neighbors are replaced by adjacent temporal neighbors, which introduces a first-order Markov dependency into the embedding
  18. Creates connected temporal neighbors (CTNs), as well as CTN connected components
    1. “Points not in this CTN connected component will be relatively distal.  Thus, CTN connected components will be separable in the embedding through simple clustering.”
  19. Processing kinematic motion:
    1. 1st iteration makes clusterable groups of motion, primitive feature groups
    2. Interpoliation is done on primitive feature groups to make parameterized primitive motion modules
    3. Spatio-temporal isomap is again applied to first embedding to make more clusterable groups of motion called behavior feature groups
    4. From a behavior feature group, meta-level behavior encapsulates component primitives and links them with transition probabilities
  20. Feels like its doing things pretty similar to slow feature analysis, “Incremental Slow Feature Analysis:
    Adaptive Low-Complexity Slow Feature Updating from High-Dimensional Input Streams” does mention this work as similar but a little different.
  21. First step requires motion segmenting.  They use both ground truth as well as an algorithm called kinematic centroid segmentation 
    1. SFA doesn’t require this, the segmenting falls out by itself
    2. <I think this is a bit of a deal breaker for us>
  22. Seems like they use a pretty simple clustering method <they call it “sweep and prune”, never heard of it>
  23. Set of motion segments in each feature group are treated as exemplars.  Variations on that can be constructed by interpolating (many different ways of doing the interpolation) between exemplars
  24. Processing sequence does:
    1. primitive features -> behavior features -> meta-level behaviors
  25. <skimming the rest because the segmentation requirement for initial processing probably makes this unusable for us>
Tagged ,

Learning Representation and Control in Markov Decision Processes: New Frontiers. Mahadevan

  1. Approach is to solve MDPs by iteratively considering low-D reps and approx optimal policies
  2. “Exact solutions of discounted and average-reward MDPs are expressed in terms of a generalized spectral inverse of the Laplacian called the Drazin inverse.”
  3. Algorithm is called representation policy iteration (RPI)
    1. Model-free and model-based versions are given
  4. “Two approaches for dimensionality reduction of MDPs are described based on geometric and reward-sensitive regularization, whereby low-dimensional representations are formed by diagonalization or dilation of Laplacian operators.”
  5. “… the aim is to solve MDPs by automatically finding ‘lowdimensional’ descriptions of ‘high-dimensional’ functions on a state (action) space.”
    1. The relevant functions are the policy, transitions, and value
  6. Method tries to find representations based on the underlying manifold dynamics
    1. Means method is usable in continuous (state) problems
  7. In the Laplacian off-diagonal elements are nonpositive, and row sums are 0
    1. Means they are singular and therefore in general don’t have a direct inverse
    2. Moore-Penrose pseudoinverse is the most common method when direct inverse isn’t possible
    3. But something called the Drazin inverse is for spectral cases and is very important in study of Markov chains
  8. In particular, for an MDP M with transition matrix T, care about
    1. A=I-T
    2. As well as the Drazin inverse of A
    3. But then the go on to call AL, and T, P, so L=I-P
  9. Getting a good low-dimensional representation depends on finding the right basis functions that span the underlying manifold
  10. Deciding what basis to use is fundamentally about making tradeoffs – for example, in a discrete domain, using tabular methods will yield optimal results but will be expensive and noncompact; using a sparser representation is cheaper in some ways <although with my critical CS hat on, I think the big-O costs are really the same, as there are order-3 space and computational costs in the linear algebra that don’t go away even when using fewer bases>
  11. “… the basis construction problem may require as much or more computational effort than required to solve the original MDP.”
    1. On the other hand, in the case where transfer can be utilized, this extra cost can me amortized across instances
  12. Two main classes of methods are considered for basis construction in MDPs:
    1. Diagonalization (of the graph Laplacian)
    2. Dilation: “For example, a dilation operator on the space of functions on real numbers is Tf(x) = f(2x).  Several dilation-based approaches will be compared, including methods based on Krylov spaces, a standard approach of solving systems of linear equations […].  Applied to MDPs, this approach results in the reward function being ‘dilated’ by powers of some operator, such as the transition matrix […].  A novel basis construction method called Drazin bases is described in this paper, which uses the Drazin inverse of the Laplacian LD… the discounted value function of a MDP can be written in terms of a series of powers of the Drazin inverse.”
    3. Another option is wavelets, or more specifically for this case, diffusion wavelets are also based on dilation.
  13. Considers average and discounted reward cases (for average reward cases, value is called gain)
  14. Mentions Laplacians for directed graphs <but I’m pretty sure this isn’t an option for arbitrary directed graphs, only special cases.>
  15. In general, need to consider “generalized inverses” of the Laplacian – it may be noninvertible if it is not of full rank
    1. There are 6 properties the actual inverse needs to satisfy; the Moore-Penrose pseudo inverse satisfies 4 and will be used for inverses related to value functions
    2. The Drazin inverse is important in the study of Markov chains
  16. <Paper is quite long – skimming most of it>
  17. The graph Laplacian induces a reproducing kernel hilbert space (RKHS)
  18. Approximate methods for solving MDPs are discussed: Least-squares, linear programming, Hilbert space
    1. “All these approaches depend crucially on a choice of basis functions for constructing a low-dimensional representation of the MDP, and assume these are provided by the human designer. <I think topic of basis construction is addressed in a later chapter>”
  19. “An elegant and general way to formalize value function approximation builds on the framework of aproximation in a type of vector space called a Hilbert space […].”
  20. A Hilbert space is a vector space where length is defined as an inner product between any two vectors (for example, Euclidian space is a Hilbert space, with inner product simply being fTg.
  21. For MDPs, distance has to be weighted according to frequency of visitation of states
  22. If bases are orthonormal, projection to find most similar elements can be defines by abstract Fourier series
  23. Least-squares approaches consist of those based on Bellman residual, and that of fixed points
    1. <This stuff doesn’t work, moving along>
  24. LSTD
  25. LSPI
  26. VI via RKHS is related to support vector regression
  27. “… it is possible to show that a least-squares approximation of the value function on a basis matrix Φ is equivalent to solving an exact ‘low-dimensional’ MDP…”
  28. Given bases and policy, an approximate low-D reward and transition model are defined
  29. By Parr: Given bases, the exact solution to approx policy evaluation defined by low-D rewards and transitions is the same as the fixed point solution of exact policy evaluation on the full MDP projected onto the bases.
    1. The computational complexity is reduced from cubic in n <the number of states I assume> to cubic in k <the number of bases I assume>
  30. “For continuous <state> MDPs, the basis functions are represented on a set of ‘sampled’ states.”
  31. Considers cases of reward-aware and reward-agnostic basis functions
  32. Construction of bases according to policies is done during RPI
    1. “It is possible to make the policy-specific bases additionally customized to a particular reward, or indeed, to have them invariant to the reward function.  At every new round of policy iteration, a new set of bases will have to be generated.  Consequently, in this formulation, the bases constructed have a fairly short life span, and multiple sets of bases have to be constructed for solving just a single MDP. On the other hand, by tuning the bases to a specific policy and reward function, it is possible to develop some theoretical guarantees on how well they can approximate Vπ.”
  33. There are also incremental (one basis at a time) and batch methods (all at once) of basis construction
    1. Bases may also be constructed incrementally from samples taken from the MDP
  34. One method of basis construction is state aggregation
    1. Discusses clustering according to Bellman residual – error bounds are given
  35. Onto “Finding Invariant Subspaces of a MDP”
    1. “Unfortunately, transition matrices Pa are often not reversible, in which case, diagonalization leads to complex eigenvectors. One solution is to replace Pa with a related stochastic matrix, i.e., reversible and whose eigenspaces provide a suitable space to compress a MDP.”
    1. On the other hand “For the sub-class of diagonalizable transition matrices represented by reversible Markov chains, the transition matrix is not only diagonalizable, but there is also an orthonormal basis.”
  36. <Timing out on this one – think I basically get the idea from his other works>

Basis Function Construction for Hierarchical Reinforcement Learning. Osentoski, Mahadevan. AAMAS 2010.

  1. For semi-MDPs
  2. “By combining methods for automatic basis construction with HRL methods, such as MAXQ, we show how a hierarchical function approximator can be automatically formed based on a specific task hierarchy.”
  3. Spectral (graph Laplacian) approach that works from trajectory information (nodes are states and edges are actions – assumes deterministic?)
  4. Taxi, VFs can be constructed by conditioning on particular elements of feature vector (like a decision tree)
  5. Task hierarchy composes a SMDP into smaller, simpler subtasks
  6.  “Our approach is divided into three steps.  The first step constructs a graph for each subtask from the agent’s experience.  We show how graph construction can leverage the abstractions provided with the task hierarchy.  The second step constructs a reduced graph based upon the graph structure.  The third step recursively constructs basis functions using basis functions from child subtasks as well as using the eigenvectors of the graph Laplacian from the reduced graph.”
  7. Because spectral methods are undirected, they just keep track of state transitions and basically throw out the actions
  8. There is a simple method of merging vertices (called a reduced graph).
  9. Then eigenvectors of this <?>
  10. Consider 3 types of abstraction:
    1. Subtask irrelevance: eliminate state variables that are irrelevant to subtask.  This occurs from the reduced graph construction: “If the state variables in Y_i have no bearing on the probability transition function, they will be irrelevant in terms of connectivity on the graph and only X_i will be used to represent the state variables.”
    2. Shielding: If there is some state that causes a path to some option to terminate <prematurely?> it wont be represented
    3. Result distribution irrelevance: If the output depends of an option depends only on a particular feature <?>
  11. Empirical results in taxi and a manufacturing domain

Constructing Basis Functions from Directed Graphs for Value Function Approximation. Johns, Mahadevan. ICML 2007.

<I discussed this with Dr. Professor Edi and he confirmed that although the formulation allows for the representation of some directed graphs, it cannot represent arbitrary directed graphs, so it won’t properly represent dynamics in general… although its probably better than trying to shoehorn an undirected approach for a directed graph.  Fundamentally, for the general case, you get complex results from the eigen decomposition.>

  1. Graph Laplacian for directed graphs, where adjacent (in one direction) states may have very different value functions.
  2. “We provide an analysis using the Dirichlet sum of the graph Laplacian to show how the smoothness of the basis functions is affected by the graph’s invariant distribution.”
  3. Standard graph laplacian only works when weights are symmetric
  4. Graph must be strongly connected and aperiodic – pretty mild assumptions
  5. Equations for the combinatorial and normalized directed Laplacians are not terribly complex
    1. <I take that back – it needs something called the Perron vector of the transition matrix, but it requires this weird manipulation that pushes the actual transition distribution in the direction of the uniform distribution.  Its not totally clear if this always has to be done, or just in special-case graphs.  Need to learn more before I make a decision about it.  Here its computed by an iterative update.>
  6. <Its not yet clear to me how much this actually fixes the problem – would have to spend more time, because in the end you still create an undirected graph, which seems like it shouldn’t be able to represent everything in a directed graph?>
  7. “The undirected Laplacian is a special case of the directed Laplacian.”
  8. Test on 2-room gridworld (where rooms are connected by directional doors), inverted pendulum (which is done by basically keeping track of balls around sampled points)
  9. <Results aren’t very impressive – performance in pendulum is flat for 30-50 basis functions, and is about 20% below optimal.>