## Learning Representation and Control in Markov Decision Processes: New Frontiers. Mahadevan

1. Approach is to solve MDPs by iteratively considering low-D reps and approx optimal policies
2. “Exact solutions of discounted and average-reward MDPs are expressed in terms of a generalized spectral inverse of the Laplacian called the Drazin inverse.”
3. Algorithm is called representation policy iteration (RPI)
1. Model-free and model-based versions are given
4. “Two approaches for dimensionality reduction of MDPs are described based on geometric and reward-sensitive regularization, whereby low-dimensional representations are formed by diagonalization or dilation of Laplacian operators.”
5. “… the aim is to solve MDPs by automatically finding ‘lowdimensional’ descriptions of ‘high-dimensional’ functions on a state (action) space.”
1. The relevant functions are the policy, transitions, and value
6. Method tries to find representations based on the underlying manifold dynamics
1. Means method is usable in continuous (state) problems
7. In the Laplacian off-diagonal elements are nonpositive, and row sums are 0
1. Means they are singular and therefore in general don’t have a direct inverse
2. Moore-Penrose pseudoinverse is the most common method when direct inverse isn’t possible
3. But something called the Drazin inverse is for spectral cases and is very important in study of Markov chains
8. In particular, for an MDP M with transition matrix T, care about
1. A=I-T
2. As well as the Drazin inverse of A
3. But then the go on to call AL, and T, P, so L=I-P
9. Getting a good low-dimensional representation depends on finding the right basis functions that span the underlying manifold
10. Deciding what basis to use is fundamentally about making tradeoffs – for example, in a discrete domain, using tabular methods will yield optimal results but will be expensive and noncompact; using a sparser representation is cheaper in some ways <although with my critical CS hat on, I think the big-O costs are really the same, as there are order-3 space and computational costs in the linear algebra that don’t go away even when using fewer bases>
11. “… the basis construction problem may require as much or more computational effort than required to solve the original MDP.”
1. On the other hand, in the case where transfer can be utilized, this extra cost can me amortized across instances
12. Two main classes of methods are considered for basis construction in MDPs:
1. Diagonalization (of the graph Laplacian)
2. Dilation: “For example, a dilation operator on the space of functions on real numbers is Tf(x) = f(2x).  Several dilation-based approaches will be compared, including methods based on Krylov spaces, a standard approach of solving systems of linear equations […].  Applied to MDPs, this approach results in the reward function being ‘dilated’ by powers of some operator, such as the transition matrix […].  A novel basis construction method called Drazin bases is described in this paper, which uses the Drazin inverse of the Laplacian LD… the discounted value function of a MDP can be written in terms of a series of powers of the Drazin inverse.”
3. Another option is wavelets, or more specifically for this case, diffusion wavelets are also based on dilation.
13. Considers average and discounted reward cases (for average reward cases, value is called gain)
14. Mentions Laplacians for directed graphs <but I’m pretty sure this isn’t an option for arbitrary directed graphs, only special cases.>
15. In general, need to consider “generalized inverses” of the Laplacian – it may be noninvertible if it is not of full rank
1. There are 6 properties the actual inverse needs to satisfy; the Moore-Penrose pseudo inverse satisfies 4 and will be used for inverses related to value functions
2. The Drazin inverse is important in the study of Markov chains
16. <Paper is quite long – skimming most of it>
17. The graph Laplacian induces a reproducing kernel hilbert space (RKHS)
18. Approximate methods for solving MDPs are discussed: Least-squares, linear programming, Hilbert space
1. “All these approaches depend crucially on a choice of basis functions for constructing a low-dimensional representation of the MDP, and assume these are provided by the human designer. <I think topic of basis construction is addressed in a later chapter>”
19. “An elegant and general way to formalize value function approximation builds on the framework of aproximation in a type of vector space called a Hilbert space […].”
20. A Hilbert space is a vector space where length is defined as an inner product between any two vectors (for example, Euclidian space is a Hilbert space, with inner product simply being fTg.
21. For MDPs, distance has to be weighted according to frequency of visitation of states
22. If bases are orthonormal, projection to find most similar elements can be defines by abstract Fourier series
23. Least-squares approaches consist of those based on Bellman residual, and that of fixed points
1. <This stuff doesn’t work, moving along>
24. LSTD
25. LSPI
26. VI via RKHS is related to support vector regression
27. “… it is possible to show that a least-squares approximation of the value function on a basis matrix Φ is equivalent to solving an exact ‘low-dimensional’ MDP…”
28. Given bases and policy, an approximate low-D reward and transition model are defined
29. By Parr: Given bases, the exact solution to approx policy evaluation defined by low-D rewards and transitions is the same as the fixed point solution of exact policy evaluation on the full MDP projected onto the bases.
1. The computational complexity is reduced from cubic in n <the number of states I assume> to cubic in k <the number of bases I assume>
30. “For continuous <state> MDPs, the basis functions are represented on a set of ‘sampled’ states.”
31. Considers cases of reward-aware and reward-agnostic basis functions
32. Construction of bases according to policies is done during RPI
1. “It is possible to make the policy-specific bases additionally customized to a particular reward, or indeed, to have them invariant to the reward function.  At every new round of policy iteration, a new set of bases will have to be generated.  Consequently, in this formulation, the bases constructed have a fairly short life span, and multiple sets of bases have to be constructed for solving just a single MDP. On the other hand, by tuning the bases to a specific policy and reward function, it is possible to develop some theoretical guarantees on how well they can approximate Vπ.”
33. There are also incremental (one basis at a time) and batch methods (all at once) of basis construction
1. Bases may also be constructed incrementally from samples taken from the MDP
34. One method of basis construction is state aggregation
1. Discusses clustering according to Bellman residual – error bounds are given
35. Onto “Finding Invariant Subspaces of a MDP”
1. “Unfortunately, transition matrices Pa are often not reversible, in which case, diagonalization leads to complex eigenvectors. One solution is to replace Pa with a related stochastic matrix, i.e., reversible and whose eigenspaces provide a suitable space to compress a MDP.”
1. On the other hand “For the sub-class of diagonalizable transition matrices represented by reversible Markov chains, the transition matrix is not only diagonalizable, but there is also an orthonormal basis.”
36. <Timing out on this one – think I basically get the idea from his other works>

## Grid Cells, Place Cells, and Geodesic Generalization for Spatial Reinforcement Learning. Gustafson, Daw. PLOS Computational Biology. 2011.

1. How does representation make learning possible in spatial navigation.
1. In particular, considering (activity of pyramidal) hippocampal place and grid cells can contribute to estimates of a value function
2. As opposed to localization, here transfer between earlier and present experience is considered
3. “Accordingly, although neural populations collectively offer a precise representation of position, our simulations of navigational tasks verify the suggestion that RL gains efficiency from the more diffuse tuning of individual neurons, which allows learning about rewards to generalize over longer distances given fewer training experiences.”
4. As opposed to Euclidian distances that may not respect the manifold of the domain (and cause problems in the case of boundaries), a geodesic distance is what is useful
5. The brain has at least 2 representations of location:
1. “Hippocampal place cells fire when a rat passes through a confined roughly concentric, region of space…”
2. “… grid cells of the dorsomedial enthrohinal cortex (dMEC) discharge at vertices of regular triangular latices […].”
6. Most studies based on place and grid cells consider what type of stimulus/environment makes them active.  Here, the consideration is how does the brain use that activity to organize behavior
7. “Importantly, this exercise views the brain’s spatial codes less as a representation for location per se, and instead as basis sets for approximating other functions across space.”
1. That is, activity of place cells alone, would be capable of producing value functions and policies, but is there something about the combination of place and grid cells that makes path planning even easier (or in particular, producing value functions that respect topography, as discontinuities in dynamics lead to discontinuities in value)
8. Activity of place and grid cells are dependent somewhat on characteristics of domain
9. In simulated experiments, there were grid worlds with different numbers of boundaries.  Start positions were randomized, but goal position always remained the same
1. To simulate place cells, Gaussian basis functions were used
2. To simulate grid cells, sine waves were used
10. Naturally, the tabular approach was the worst (no generalization).  In the simplest domain the place cells are clearly better than grid cells, which are clearly better than tabular.  In the hardest domain, the performance of grid+place cells are equivalent and still better than tabular
11. But, looking at the value functions, it is clear that goodness from the goal is “bleeding” across boundaries in a way that is not appropriate
12. Because of the overgeneralization, in another set of more complex tasks, the tabular representation does better
13. To fix this, and so that basis functions would respect topology, points were assigned new x-y coordinates, basically by running connectivity through ISOMAP
1. After doing this, there wasn’t spillage across boundaries.
14. <Although they use shortest geodesic distance, there is no reason why that would be the only method that would produce these results.  Basically, you just need something that respects the fact that you cant cross walls (for example a random walk also respects this>
15. <I’m not sure why, but they keep comparing tabular vs grid vs place cells.  The brain has the latter two together, so why not show the results of their combined activity?  Maybe there is also some prevention of spillage across boundaries in the naive case when used together, or something else interesting….>
16. There are results from place cell activity that respect domain topology w.r.t. doors/boundaries
1. This is true also when the domain is nonstationary/changes during activity.  This can even cause new place cell activity
17. “One of the hallmarks of model-based planning (and the behavioral phenomena that Tolman  used to argue for it, albeit not subsequently reliably demonstrated in the spatial domain), is the ability to plan novel routes without relearning, e.g. to make appropriate choices immediately when favored routes are blocked or new shortcuts are opened. Interestingly, rather than by explicit replanning, some such behaviors could instead be produced more implicitly by updating the basis functions to reflect the new maze, while maintaining the weights connecting them to value. This is easy to demonstrate in the successor representation , a model closely related to ours.”

## Constructing Basis Functions from Directed Graphs for Value Function Approximation. Johns, Mahadevan. ICML 2007.

<I discussed this with Dr. Professor Edi and he confirmed that although the formulation allows for the representation of some directed graphs, it cannot represent arbitrary directed graphs, so it won’t properly represent dynamics in general… although its probably better than trying to shoehorn an undirected approach for a directed graph.  Fundamentally, for the general case, you get complex results from the eigen decomposition.>

1. Graph Laplacian for directed graphs, where adjacent (in one direction) states may have very different value functions.
2. “We provide an analysis using the Dirichlet sum of the graph Laplacian to show how the smoothness of the basis functions is affected by the graph’s invariant distribution.”
3. Standard graph laplacian only works when weights are symmetric
4. Graph must be strongly connected and aperiodic – pretty mild assumptions
5. Equations for the combinatorial and normalized directed Laplacians are not terribly complex
1. <I take that back – it needs something called the Perron vector of the transition matrix, but it requires this weird manipulation that pushes the actual transition distribution in the direction of the uniform distribution.  Its not totally clear if this always has to be done, or just in special-case graphs.  Need to learn more before I make a decision about it.  Here its computed by an iterative update.>
6. <Its not yet clear to me how much this actually fixes the problem – would have to spend more time, because in the end you still create an undirected graph, which seems like it shouldn’t be able to represent everything in a directed graph?>
7. “The undirected Laplacian is a special case of the directed Laplacian.”
8. Test on 2-room gridworld (where rooms are connected by directional doors), inverted pendulum (which is done by basically keeping track of balls around sampled points)
9. <Results aren’t very impressive – performance in pendulum is flat for 30-50 basis functions, and is about 20% below optimal.>

## Analyzing Feature Generation fro Value-Function Approximation. Parr, Painter-Wakefield, Li, Littman. ICML 2007

1. Deals with constructing functions for doing VFA based on adding dimensions that minimize the Bellman residual of the current representation
2. Gives a nod to the proto-value function paper
3. These basis functions are called Bellman Error Basis Functions (BEBFs)
4. “… BEBFs form an orthonormal basis with guaranteed improvement in approximation quality at each iteration.”
5. Representing the Bellman error can be just as hard as representing the value function, so this makes it seem like one hard problem is just being replaced with another one, but they show “… approximation quality can still improve even if there is significant error in the estimate of the Bellman error.”
6. Paper deals with either the setting where the MDP is known a-priori, or when the algorithm has to run off of live samples.  Naturally the first case is easier, and error is monotonically reducing
7. Much of the paper assumes Markov chains and not MDPs
8. Consider case where looking for a weighing of some set of basis functions φ
1. Matrix Φ is formed with φ spanning columns and S spanning the rows
9. The basic idea is to add another φ to the current set of k basis functions is to set φ k+1 = TV-V

1. In general this will be difficult to represent so some approximation must be used
2. There is a trick to getting something close to this, is impacted by γ
10. “Graph-based methods have the advantage of being less directly connected to the reward and value function.  The basis functions produced by these methods are derived from connectivity properties of the state space and could, potentially, be more generally applicable to a wide class of  problems with similar state space connectivity but different reward functions.  Graph-based methods have the disadvantage of being less directly connected to the reward and value functions.  The disconnect from the reward and value function makes it difficult to guarantee a specific amount of progress with each iteration.”
11. Mentions another similar approach called matching pursuits
12. In the first batch of results they just use the full model, which is admittedly not so useful as you can just solve the MDP for basically the same cost
13. In the second set of results they build on LSPI
1. LSPI normally passes on its basis functions to LSTD, which is run |A|  times.  Here, new basis functions are recomputed at each policy evaluation step in LSPI
14. <End up needing a pretty large number of basis functions, but they probably could have gotten by with less if they accepted more error>

## Low Complexity Proto-Value Function Updating with Incremental Slow Feature Analysis. Luciw, Schmidhuber. EWRL 2012.

1. Paper uses incremental (as opposed to batch) slow feature analysis (IncSFA) for learning proto-value functions
2. POMDP paper
3. “Since they [proto-value functions] provide a good basis for learning different reward functions, they are good building blocks for general reinforcement learning.”
4. Uses TD to do VFA on top of the features developed by ISFA
5. A previous work (Sprekeler 2011, in my reading list) “… showed how SFA can be considered a function approximation to learning PVFs [proto-value functions] (…), so slow features (SFs) can have the same set of beneficial properties for representation learning for general RL .”
6. “In this paper, we show how IncSFA can be used as a low complexity method for an autonomous RL agent to learn features of a spectral (graph)-embedding of its environment, which it experiences indirectly through high-dimensional observations (images).”
7. Goal is to find a set of basis functions that can represent the value function well over a wide class of reward functions
8. <I am confused as it says phi makes up the basis functions for estimating value, but then t says phi maps the observation vector x to y – I guess there is another dot product somewhere that turns y into a scalar?>
9. Proto-value functions are intended to “… capture global dynamic characteristics of the MDP in a low dimensional space.”  The idea is to find some basis that preserves distance between states.
10. In Proto-reinforcement learning, the basis comes from the graph Laplacian
11. SFA has a slightly different approach which is to look for ways of weighing the basis functions so that the output signal changes slowly
12. Equivalence of PVFs and SFs. [their emphasis] Sprekeler (2011) showed that two objectives, of SFA (slowness) and LEM [Laplacian EigenMap] (nearness), are mathematically equivalent under some reasonable (i.e., when we collect data from a random walk on an MDP) assumptions.  For intuition, note that saying two observations have a high probability of being successive in time is another way of saying that the two underlying states have a high probability of being connected  In the LEM formulation, the neighbor relationships are explicit (through the adjacency matrix), but in SFA’s they are implicit (through temporal succession).”
13. [Immediately next paragraph] “Note that SFA additionally depends on function space F.  The main reason the slow features (SFs) are approximations of the PVFs is due to this function space, and what information it captures and what information it lacks.”
14. IncSFA doesn’t build a covariance matrix, even for temporary use between data points
15. The math ends up approximating the Laplacian according to a random walk
16. IncSFA is much more computationally efficient than Laplacian EigenMap or batch SFA – in these two the computation is dominated by the Eigen decomposition, which is cubic
17. On the other hand, as a batch method it uses each data point less efficiently
18. The first experiment is a 20-state chain where the state representation is a zero-vector of length 20, with the current state changing value to a 1.
1. Features extracted after training is very similar to Laplacian EigenMap (basically sinusoids that are slightly time-shifted but of very similar wavelength)
2. The learning times, are very long, however – on the order of 10,000 or 20,000 samples
3. <I don’t think they mention the basis functions used here>
19. The second experiment is a 2D task with nonstationary reward.
20. <In both experiments actions are taken randomly, so would exponential experience to get adequate coverage in some domains.>
21. The third experiment is another 2d navigation task. Again embedding of Laplacian EigenMap looks essentially the same as IncSFA
22. “The slow features are approximately proto-value functions, therefore graphical embedding can be done with linear complexity via IncSFA without constructing a graph (or a covariance matrix).”
23. Linear time and space requirements, and that its online add to biological plausibility of the method.