- <Ok I’m basically copying the abstract verbatim here – thats how you know its a good abstract>
- “This paper introduces a novel
*spectral*framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies.” - The components introduced are:
- “A general scheme for constructing representations or
*basis functions*by diagonalizing symmetric diffusion operators” - “A specific instantiation of this approach where global basis functions called
*proto-value functions*(PVFs) are formed using the eigenvectors of the*graph Laplacian*on an undirected graph formed from state transitions induced by the MDP” - “A three-phased procedure called
*representation policy iteration*comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a final parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions.” - “A specific instantiation of the RPI framework using least-squares policy iteration (LSPI) as the parametric estimation method”
- Methods for scaling to large discrete and continuous spaces

- “A general scheme for constructing representations or
- The experiments are conducted in discrete and continuous domains
*Diffusion model**s*are a core part of the method presented based on the graph Laplacian- This is done by finding the eigenvectors that correspond to the smallest eigenvalues
- These eigenvectors are supposed to capture the long term temporal properties of a process. They are similar to value functions. <I think the relationship between the two can be quite loose>

- Laplacian bases can be used with algorithms like Q-Learning and LSPI
- “The fundamental idea is to construct basis functions for solving MDPs by diagonalizing symmetric diffusion operators on an empirically learned graph representing the underlying state space. A diffusion model is intended to capture information flow on a graph or a
*manifold*. A simple diffusion model is a random walk on an undirected graph, where the probability of transitioning from a vertex (state) to its neighbor is proportional to its degree…” - “… it can be significantly easier to estimate a ‘weak’ diffusion model, such as the undirected random walk
*P_r*or the combinatorial Laplacian*L*, than to learn the true underlying transition matrix*P*^π of a policy π.” <I think “can” is the operative word here – sometimes things like a random walk will not accurately learn dynamics of the domain.> - [Immediately following] “The proposed framework can be viewed as automatically generating subspaces on which to project the value function using spectral analysis of operators on graphs.” This is different from other methods that simply try to find weights for pre-supplied basis vectors, or even other more sophisticated approaches of learning basis functions based on trajectories or Bellman residual.
- Basis functions are based on spectral analysis/diffusion operators that don’t take reward into account
- Mentions approaches that
*do*take reward into account

- Mentions approaches that
- |
*S*| eigenvectors of course can represent the exact value function – the trick is to try and get a set of eigenvectors that is much smaller than that but can still lead to little error in the value function - “By projecting a given value function on the space spanned by the eigenvectors of the graph Laplacian, the ‘spatial’ content of a value function is mapped into a ‘frequency’ basis, a hallmark of classical ‘Fourier’ analysis (Mallat, 1989).”
- The approach here (Representation Policy Iteration) attempts to correct weaknesses of more traditional approaches that use VFA, by interleaving representation and policy learning
- The method is supposed to recover underlying manifolds – this occurs frequently in physics based domains (such as the inverted pendulum).
- “… the problem [of finding basis functions] is one of dimensionality reduction not in the data space, but on the space of functions on the data.” <Not exactly sure what that means>
- The method here models only reachable space (this is what the manifold does) “… it also models local
*non-uniformity*of a given region.” By modeling density across regions - “The additional power obtained from knowledge of the underlying state graph or manifold comes at a potentially significant cost: the manifold representation needs to be learned, and furthermore, basis functions need to be computed from it. Although our paper demonstrates that eigenvector-type basis functions resulting from a diffusion analysis of graph-based manifolds can solve standard benchmark discrete and continuous MDPs, the problem of efficiently learning manifold representation of
*arbitrary*MDPs is beyond the scope of this introductory paper. We discuss a number of outstanding research questions in Section 9 that need to be addressed…”- <
**I thought one of the stated benefits of the approach is that it learns how do to this itself**>

- <
- In Fourier analysis, bases are based on frequency but not time or space. Similarly, eigenfunctions of the graph Laplacian are also “localized in frequency by … λ, but their support is in general the whole graph.”
- The question, though is whether Laplacian bases can be dealt with in various types of domains.
- “…large factored spaces, such as grids, hypercubes, and tori, lead naturally to
*product spaces*for which the Laplacian bases can be constructed efficiently using*tensor products*.” - There are also ways of working in continuous domains, although it involves stuff I’m not familiar with: “low-rank approximations and the Nystrom interpolation method…”

- “…large factored spaces, such as grids, hypercubes, and tori, lead naturally to
- The basic structure of representation policy iteration (RPI) is a repeated cycle of sample collection, basis construction, and policy learning
- From the samples collected during that phase, a (potentially directed) graph is constructed. In continuous spaces this can be done according to some distance metric (knn is mentioned). Samples can be collected on or off policy
- Based on this, a proto-value function is computed using some graph operator, such as “the combinatorial or normalized Laplacian” <whats the difference?>

- Action encoding is an extra step (can just do |A| copies of the representation produced by the graph operation)
- This is necessary because of the Markov chain representation used and not MDP for the Laplacian

- Stressed that what they outline here is one instantiation of a more general approach they are advocating
- They are able to very accurately reproduce the value function for the 2-room problem, <
**although they need 20 eigenvectors and the full problem is only 100 individual states, so you aren’t really saving much**>- They show the first 4 eigenvectors/Laplacian basis functions, the first two are basically piecewise constant for the first room or the second, and the second 2 carve each of those individual rooms into another half
- Its neat, seems like a Fourier decomposition that also respects the topology

- Use LSPI
- <gamma is lower than what is commonly used at 0.8, although in this case it doesn’t impact the optimal policy. This value may have been selected to get a steeper slope of the value function in the graphs, but there is another worrying possibility that a lower value was used because they lead to more stable behavior when VFA is used,>

- “[my emphasis] Value functions generally exhibit two key properties: they are typically
*smooth*, and they reflect the underlying state geometry.”- <Although it is certainly not true that value functions are smooth in terms of normal representations, they claim> smoothness exists in terms of the state space graph – based on the “Sobolev Norm”
- The trick is to set up basis functions that defines closeness as closeness in terms of the optimal value function

- Each basis function (there are usually many) is a mapping from states to reals.
- “The basis function matrix Φ is an |
*S*|x*k*matrix, where each column is a particular basis function evaluated over the state space, and each row is the set of all possible basis functions evaluated on a particular state.” Then of course the value function is approximated by*V*=Φ*W*

- “The basis function matrix Φ is an |
- This paper actually goes a little bit into Value and Hilbert spaces. I’ve seen stuff built on this before, but not really the math behind it
- VFA is a problem of best approximation in Hilbert space
- Distance is basically an inner product between value functions
- If the bases are orthonormal, “… the best approximation of the value function
*V*^π can be expressed by its projection onto the space spanned by the basis functions…”

- “Our approach to the problem of control learning involves finding a suitable set of basis functions by diagonalizing a learned diffusion model for policy evaluation and improvement. We first introduce the
*Fourier*approach of finding basis functions by diagonalization, and then describe how diffusion models are used as a substitute for transition models.” - In some cases the reward function can interact with the eigenvectors poorly (if they are orthonormal), but in they domains considered here this issue did not arise
- Since the Markov chain depends on a policy that is changing, they advocate for diffusion models <based on random walks>
- A random walk matrix
*P_r*can be decomposed s.t.*P_r = D*^{-1}*W*- In the example here it is not symmetric but it is reversible <need to pay attention to how this works in the non-reversible case>
*“… P_r = D*^{-1}*W*is called a diffusion model because given any function*f*on the underlying graph*G*, the powers of*P^t_rf*determine how quickly the random walk will ‘mix’ and converge to the long term distribution.” Ah so its the same as the mixing time that I’m familiar with

- The more general case of directed graphs is discussed later
- “A fundamental property of the graph Laplacian is that projections of functions on the eigenspace of the Laplacian produce the smoothest global approximation respecting the underlying graph topology…”
- “The difference between the combinatorial and normalized Laplacian is that the latter models the degree of a vertex as a local measure.” <what does that mean>
- VFA via LSPI
- Experimental results on small discrete MDPs
- Theres a pretty strong dichotomy in the quality of policies – when 10 or 15 basis functions are used the policy is esentially equivalently bad; for 20 or 35 it is equivalently good
- <This means that you need a certain number of eigenvectors to represent a good policy, and if you have fewer than that, looks like you are in trouble. Would be more comforting to see a smoother transition in terms of policy quality as the number of bases increases>

- Then move on to larger discrete and continuous spaces
- Consider domains with factored state
- In a
*d*-dimensional space where each dimension has*k*options, go from an O(*k^d*) representation to some other representation that is O(*dk*^2) <not sure how that works yet – they must be throwing something out> - There are a number of ways of building complex graphs from simpler ones. Here focus is on Kronecker (or tensor) sum of graphs
- It sort of allows you to compose graphs that cover higher dimensional spaces by using lower dimensional ones. For examples, 2 chains can be used to make a grid (and furthermore into hypercubes/grids), or a chain and a loop can be used to compose a cylinder.

- “A standard result that follows from the above theorem shows that the combinatorial graph Laplacian of a ronecker sum of two graphs can be computed from the Laplacian of each subgraph.”
- Derive a update rule for factored RPI/LSPI in “structured domains when the basis functions can be represented as Kronecker products of elementary basis functions on simpler state spaces. Basis functions are
*column*eigenvectors of the diagonalized representation of a graph operator, whereas embeddings φ(s) are*row*vectors representing the first*k*basis functions evaluated on state*s.*By exploiting the property that (*A*xor*B*)^*T*=*A^T*xor*B^T*, it follows that embeddings for structured domains can be computed as the Kronecker products of embeddings for the constituent state components. As a concrete example, a grid world domain of size*m*x*n*can be represented as a graph*G*=*G*⊕_{m}*G*where_{n}*G*_{m}*G*_{n}*path graphs*of size*m*and*n*, respectively.”- The basis functions can then be written in terms of eigenvectors of
*m*and*n*

- The basis functions can then be written in terms of eigenvectors of
- They compare this against RBFs in a cooperative multiagent domain, where RBFs learn more quickly (about 3x as fast) but converge to a poorer result (takes 2x as long to reach goal)
- <
**The obvious question then is, if you put in more RBFs, the learning will slow down but will converge to something better – what happens when more are used?**>

- <
- Now onto continuous domains
- “The eigenfunctions of the Laplacian can only be computed and stored on sampled real-valued states, and hence must be interpolated to novel [unexperienced] stated. We apply the Nystrom interpolation method.”
- It has been used before for kernel methods and spectral clustering, but here the aplication to control is novel

- “The Nystrom method interpolates the value of eigenvectors computed on sample states to novel states, and is an application of a classical method used in the numerical solution of integral equations (…).”
- Starts out with an integral but at the end uses a summation

- “Note that they Nystrom method does
*not*require recalculating eigenvectors-inessence, the embedding of a new state is computed by averaging over the already computed embeddings of ‘nearby’ states.” - Now onto continuous stuff
- They attach nodes together in knn fashion, where edge weights are based on a distance metric
- This results in an asymmetric weight matrix, so an extra step is taken to make it symmetric
- And extra steps must be taken in some cases depending on how the data step is constructed
- They test in mountain car, acrobot, and inverted pendulum
- Results are better when data comes from on-policy trajectories than random trajectories
- For mountain car they use the same start state but randomize starting trajectories
- <This was probably done through necessity – random policies take a very long time to reach the goal when initialized with 0 velocity>

- “In the Acrobot task, the best results were obtained for 25 and 100 PVFs, and significantly poorer results for 50 PVFs.”
- Similar nonlinartites in terms of # of performance as a function of # of bases in other domains
- <
**This is bad and should not happen- I wonder if they will give an interpretation of whats going on.**Its not like one way learns faster but converges to a poorer result either – some are just better than others. This is most pronounced in acrobot, although mountain car has a weird spike (in the bad direction) in mountain car for the mid-number of PVFs early in training.> - Sensitivity to k in knn is not bad

- In some tasks the combinatorial Laplacian is better and others the normalized Laplacian is better
- Bunch of references for spectral learnning
- “… these methods have largely (but not exclusively) been applied to nonlinear dimensionality reduction and semi-supervised learning on graphs, whereas our work focuses on approximating (real-valued) value functions on graphs.” And VFAs are much more complex than the normal sort of function approximation

- Future work:
- Simultaneously learning basis functions and policies
- Optimality and convergence guarantees?
- Are there cases where on-policy sampling results in some sort of oscilation?

**Directed graphs are left as future work**, <although oddly it doesn’t seem to make computation that much worse – maybe there is more to it than the one equation they give, or why wouldn’t they just use that here when it makes so much more sense>- “In this paper, proto-value functions were constructed by diagonalization, that is by finding eigenvectors, of a symmetrized diffusion operator such as the Laplacian on an undirected graph. Formally, such eigenvectors are essentially global Fourier bases and their properties have been extensively studied in Euclidean spaces (Mallat, 1989). One well-known limitation of global Laplacian bases is that they are poor at representing piecewise linear (value) functions. We have extended the approach presented in this paper to construct multiscale diffusion bases, using the recently proposed diffusion wavelet framework (Coifman and Maggioni, 2006; Bremer et al., 2006). Diffusion wavelets provide an interesting alternative to global Fourier eigenfunctions for value function approximation, since they encapsulate all the traditional advantages of wavelets (Mallat, 1989): basis functions have compact support, and the representation is inherently hierarchical since it is based on multi-resolution modeling of processes at different spatial and temporal scales.”
- Mention possibility of reward-sensitive PVF and some related work.
- And then “A more direct way to incorporate reward-sensitive information into PVFs is to modify the weight matrix W to take into account the gradient of the value function to be approximated. Formally , this approach is similar to estimating a function by knowing not only its values at sample points, but also its gradient. Of course, any errors in the estimation of such gradients will then be reflected in the weight matrix, and such an approach is not also without some drawbacks. While making bases sensitive to rewards can lead to superior results, if the reward function or policy is modified, reward-sensitive basis functions would need to be re-learned.“
- Possibility of learning state-action graphs, instead of doing it uniformly across states from a random policy and then copying one of these for each action.
**Semi-MDPs get a mention**- <Overall, the methods in the paper seem extremely complex when considering the entire system, and empirical results aren’t outstanding when considering other methods such as VFA with RBFs, which is usually pretty terrible. There is a ton of information, and the methods touch on a huge number of different areas, but thats more of a problem than a good thing. If you can get similar results with methods where you don’t need detailed knowledge from so many different areas, you should use the simpler method. I could spend months mining this paper for references and ideas and learning good stuff though.>