## Basis Function Construction for Hierarchical Reinforcement Learning. Osentoski, Mahadevan. AAMAS 2010.

1. For semi-MDPs
2. “By combining methods for automatic basis construction with HRL methods, such as MAXQ, we show how a hierarchical function approximator can be automatically formed based on a specific task hierarchy.”
3. Spectral (graph Laplacian) approach that works from trajectory information (nodes are states and edges are actions – assumes deterministic?)
4. Taxi, VFs can be constructed by conditioning on particular elements of feature vector (like a decision tree)
6.  “Our approach is divided into three steps.  The first step constructs a graph for each subtask from the agent’s experience.  We show how graph construction can leverage the abstractions provided with the task hierarchy.  The second step constructs a reduced graph based upon the graph structure.  The third step recursively constructs basis functions using basis functions from child subtasks as well as using the eigenvectors of the graph Laplacian from the reduced graph.”
7. Because spectral methods are undirected, they just keep track of state transitions and basically throw out the actions
8. There is a simple method of merging vertices (called a reduced graph).
9. Then eigenvectors of this <?>
10. Consider 3 types of abstraction:
1. Subtask irrelevance: eliminate state variables that are irrelevant to subtask.  This occurs from the reduced graph construction: “If the state variables in Y_i have no bearing on the probability transition function, they will be irrelevant in terms of connectivity on the graph and only X_i will be used to represent the state variables.”
2. Shielding: If there is some state that causes a path to some option to terminate <prematurely?> it wont be represented
3. Result distribution irrelevance: If the output depends of an option depends only on a particular feature <?>
11. Empirical results in taxi and a manufacturing domain

## Constructing Basis Functions from Directed Graphs for Value Function Approximation. Johns, Mahadevan. ICML 2007.

<I discussed this with Dr. Professor Edi and he confirmed that although the formulation allows for the representation of some directed graphs, it cannot represent arbitrary directed graphs, so it won’t properly represent dynamics in general… although its probably better than trying to shoehorn an undirected approach for a directed graph.  Fundamentally, for the general case, you get complex results from the eigen decomposition.>

1. Graph Laplacian for directed graphs, where adjacent (in one direction) states may have very different value functions.
2. “We provide an analysis using the Dirichlet sum of the graph Laplacian to show how the smoothness of the basis functions is affected by the graph’s invariant distribution.”
3. Standard graph laplacian only works when weights are symmetric
4. Graph must be strongly connected and aperiodic – pretty mild assumptions
5. Equations for the combinatorial and normalized directed Laplacians are not terribly complex
1. <I take that back – it needs something called the Perron vector of the transition matrix, but it requires this weird manipulation that pushes the actual transition distribution in the direction of the uniform distribution.  Its not totally clear if this always has to be done, or just in special-case graphs.  Need to learn more before I make a decision about it.  Here its computed by an iterative update.>
6. <Its not yet clear to me how much this actually fixes the problem – would have to spend more time, because in the end you still create an undirected graph, which seems like it shouldn’t be able to represent everything in a directed graph?>
7. “The undirected Laplacian is a special case of the directed Laplacian.”
8. Test on 2-room gridworld (where rooms are connected by directional doors), inverted pendulum (which is done by basically keeping track of balls around sampled points)
9. <Results aren’t very impressive – performance in pendulum is flat for 30-50 basis functions, and is about 20% below optimal.>

## Diffusion Kernels and Other Discrete Input Spaces. Kondor, Lafferty. ICML 2002.

1. Extends use of kernels from real-valued data to discrete data
2. Particular focus is on graphs which gets a particular type of kernel called the diffusion kernel “… which are based on the heat equation and can be regarded as the discretization of the familiar Gaussian kernel of Euclidian space.”
3. Kernels allow for a mapping into Hilbert space where the kernel is an inner product.  With respect to a basis a datapoint is mapped to many (potentially infinite) number of features – this expansion makes a number of operation simpler
4. “… adjacency graphs are sometimes used when data is expected to be confined to a manifold of lower dimensionality that the original space.”
5. Kernels must be symmetric and positive semi-definite
1. “Constructing appropriate positive definite kernels is not a simple task, and this has largely been the reason why, with a few exceptions, kernel methods have mostly been confined to Euclidian spaces…”
6. When working in discrete spaces, the most common approach with kernels has been to find some mapping from the discrete to continuous spaces (for example, transforming a text to a bag/count of words)
7. The kernels they propose for graphs they call diffusion kernels
8. Start out with a kernel for discrete data called exponential kernels
9. In discrete case, positive semi-definiteness means:
1. x ∈ Xx’ ∈ X  fx fx’ K(x,x’) ≥ 0
10. In the discrete case, for ﬁnite X , the kernel can be uniquely represented by an |X| × |X| matrix (which we shall denote by the same letter K) with rows and columns indexed by the elements of X , and related to the kernel by Kxx’ = K(x,x’).”
11. Defines the exponential of a square matrix – the exponential of any symmetric matrix is symmetric positive semi-definite so it may be used as a kernel
12. Tensor products and conjugacy <that passed me by>
13. Onto diffusion kernels on graphs
14. For a graph Γ, the heat equation H <is a matrix> s.t. Hij = :
1. 1 if i is connected by an edge to j
2. di if i=j, where di is the degree of vertex i
3. 0 otherwise
15. The negative of this is the graph Laplacian of Γ
16. H can be regarded as an approximator, and H/h2 approaches the continuous laplacian as h → 0
1. In physics this equation is used to approximate the diffusion of heat (and other substances) through a continuous media.
17. The resulting kernels HKβ are called diffusion kernels or heat kernels
18. “To the best of our knowledge, diffusion kernels have not been proposed previously for direct use in kernel-based learning algorithms.”
19. This extends simply to unweighted graphs as well as symmetric graphs
1. <But what about assymetric graphs?>
20. Equations for unweighted <I suppose undirected> graphs, closed chains, trees
1. These can be used “… as building blocks for tensor product kernels.”
2. For example, bit (or over any discrete alphabet) strings can be composed in this way to generate a Hamming distance
3. Other methods of string distances for looking for  contiguous substrings.  The kernel used approaches a result obtained by dynamic programming
21. In the first set of experiments diffusion kernels are used for classification “For such problems, it is often quite unnatural to encode the data as vectors in Euclidean space to allow the use of standard kernels. However as our experiments show, even simple diffusion kernels on the hypercube, as described in Section 4.4, can result in good performance for such data.”
1. “Data sets “having a majority of categorical variables were chose; any continuous features were ignored.”
2. Used “voted perceptron” for training <don’t know what that is>
3. Diffusion and Hamming kernels were used.  “The reduction in error rate varies, but the simple diffusion kernel generally performs well.”
4. <I’m not sure what conclusion I’m supposed to draw from these results – where are the comparisons to other algorithms?  Compares well to Hamming distance?  I would have liked to see results from other more impressive approaches.>
22. “We showed how the explicit calculation of diffusion kernels is possible for specific families of graphs, and how the kernels correspond to standard Gaussian kernels in a continuous limit.”
23. “While the tensor product construction allows one to incrementally build up more powerful kernels from simple components, explicit formulas will be difficult to come by in general.”

## Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes. Mahadevan, Maggioni. JMLR 2007.

1. <Ok I’m basically copying the abstract verbatim here – thats how you know its a good abstract>
2. “This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies.”
3. The components introduced are:
1. “A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators”
2. “A specific instantiation of this approach where global basis functions called proto-value functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP”
3. “A three-phased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a final parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions.”
4. “A specific instantiation of the RPI framework using least-squares policy iteration (LSPI) as the parametric estimation method”
5. Methods for scaling to large discrete and continuous spaces
4. The experiments are conducted in discrete and continuous domains
5. Diffusion models are a core part of the method presented based on the graph Laplacian
1. This is done by finding the eigenvectors that correspond to the smallest eigenvalues
2. These eigenvectors are supposed to capture the long term temporal properties of a process.  They are similar to value functions. <I think the relationship between the two can be quite loose>
6. Laplacian bases can be used with algorithms like Q-Learning and LSPI
7. “The fundamental idea is to construct basis functions for solving MDPs by diagonalizing symmetric diffusion operators on an empirically learned graph representing the underlying state space.  A diffusion model is intended to capture information flow on a graph or a manifold.  A simple diffusion model is a random walk on an undirected graph, where the probability of transitioning from a vertex (state) to its neighbor is proportional to its degree…”
8. “… it can be significantly easier to estimate a ‘weak’ diffusion model, such as the undirected random walk P_r or the combinatorial Laplacian L, than to learn the true underlying transition matrix P^π of a policy π.”  <I think “can” is the operative word here – sometimes things like a random walk will not accurately learn dynamics of the domain.>
9. [Immediately following] “The proposed framework can be viewed as automatically generating subspaces on which to project the value function using spectral analysis of operators on graphs.”  This is different from other methods that simply try to find weights for pre-supplied basis vectors, or even other more sophisticated approaches of learning basis functions based on trajectories or Bellman residual.
10. Basis functions are based on spectral analysis/diffusion operators that don’t take reward into account
1. Mentions approaches that do take reward into account
11. |S| eigenvectors of course can represent the exact value function – the trick is to try and get a set of eigenvectors that is much smaller than that but can still lead to little error in the value function
12. “By projecting a given value function on the space spanned by the eigenvectors of the graph Laplacian, the ‘spatial’ content of a value function is mapped into a ‘frequency’ basis, a hallmark of classical ‘Fourier’ analysis (Mallat, 1989).”
13. The approach here (Representation Policy Iteration) attempts to correct weaknesses of more traditional approaches that use VFA, by interleaving representation and policy learning
14. The method is supposed to recover underlying manifolds – this occurs frequently in physics based domains (such as the inverted pendulum).
15. “… the problem [of finding basis functions] is one of dimensionality reduction not in the data space, but on the space of functions on the data.” <Not exactly sure what that means>
16. The method here models only reachable space (this is what the manifold does) “… it also models local non-uniformity of a given region.” By modeling density across regions
17. “The additional power obtained from knowledge of the underlying state graph or manifold comes at a potentially significant cost: the manifold representation needs to be learned, and furthermore, basis functions need to be computed from it.  Although our paper demonstrates that eigenvector-type basis functions resulting from a diffusion analysis of graph-based manifolds can solve standard benchmark discrete and continuous MDPs, the problem of efficiently learning manifold representation of arbitrary MDPs is beyond the scope of this introductory paper.  We discuss  a number of outstanding research questions in Section 9 that need to be addressed…”
1. <I thought one of the stated benefits of the approach is that it learns how do to this itself>
18. In Fourier analysis, bases are based on frequency but not time or space.  Similarly, eigenfunctions of the graph Laplacian are also “localized in frequency by … λ, but their support is in general the whole graph.”
19. The question, though is whether Laplacian bases can be dealt with in various types of domains.
1. “…large factored spaces, such as grids, hypercubes, and tori, lead naturally to product spaces for which the Laplacian bases can be constructed efficiently using tensor products.”
2. There are also ways of working in continuous domains, although it involves stuff I’m not familiar with: “low-rank approximations and the Nystrom interpolation method…”
20. The basic structure of representation policy iteration (RPI) is a repeated cycle of sample collection, basis construction, and policy learning
1. From the samples collected during that phase, a (potentially directed) graph is constructed.  In continuous spaces this can be done according to some distance metric (knn is mentioned).  Samples can be collected on or off policy
2. Based on this, a proto-value function is computed using some graph operator, such as “the combinatorial or normalized Laplacian” <whats the difference?>
21. Action encoding is an extra step (can just do |A| copies of the representation produced by the graph operation)
1. This is necessary because of the Markov chain representation used and not MDP for the Laplacian
22. Stressed that what they outline here is one instantiation of a more general approach they are advocating
23. They are able to very accurately reproduce the value function for the 2-room problem, <although they need 20 eigenvectors and the full problem is only 100 individual states, so you aren’t really saving much>
1. They show the first 4 eigenvectors/Laplacian basis functions, the first two are basically piecewise constant for the first room or the second, and the second 2 carve each of those individual rooms into another half
2. Its neat, seems like a Fourier decomposition that also respects the topology
24. Use LSPI
1. <gamma is lower than what is commonly used at 0.8, although in this case it doesn’t impact the optimal policy.  This value may have been selected to get a steeper slope of the value function in the graphs, but there is another worrying possibility that a lower value was used because they lead to more stable behavior when VFA is used,>
25. “[my emphasis] Value functions generally exhibit two key properties: they are typically smooth, and they reflect the underlying state geometry.”
1. <Although it is certainly not true that value functions are smooth in terms of normal representations, they claim> smoothness exists in terms of the state space graph – based on the “Sobolev Norm”
2. The trick is to set up basis functions that defines closeness as closeness in terms of the optimal value function
26. Each basis function (there are usually many) is a mapping from states to reals.
1. “The basis function matrix Φ is an |S|x k matrix, where each column is a particular basis function evaluated over the state space, and each row is the set of all possible basis functions evaluated on a particular state.”  Then of course the value function is approximated by VW
27. This paper actually goes a little bit into Value and Hilbert spaces.  I’ve seen stuff built on this before, but not really the math behind it
1. VFA is a problem of best approximation in Hilbert space
2. Distance is basically an inner product between value functions
3. If the bases are orthonormal, “… the best approximation of the value function V^π can be expressed by its projection onto the space spanned by the basis functions…”
28. “Our approach to the problem of control learning involves finding a suitable set of basis functions by diagonalizing a learned diffusion model for policy evaluation and improvement.  We first introduce the Fourier approach of finding basis functions by diagonalization, and then describe how diffusion models are used as a substitute for transition models.”
29. In some cases the reward function can interact with the eigenvectors poorly (if they are orthonormal), but in they domains considered here this issue did not arise
30. Since the Markov chain depends on a policy that is changing, they advocate for diffusion models <based on random walks>
31. A random walk matrix P_r can be decomposed s.t. P_r = D^{-1}W
1. In the example here it is not symmetric but it is reversible <need to pay attention to how this works in the non-reversible case>
2. “… P_r = D^{-1}is called a diffusion model because given any function f on the underlying graph G, the powers of P^t_rf determine how quickly the random walk will ‘mix’ and converge to the long term distribution.”  Ah so its the same as the mixing time that I’m familiar with
32. The more general case of directed graphs is discussed later
33. “A fundamental property of the graph Laplacian is that projections of functions on the eigenspace of the Laplacian produce the smoothest global approximation respecting the underlying graph topology…”
34. “The difference between the combinatorial and normalized Laplacian is that the latter models the degree of a vertex as a local measure.” <what does that mean>
35. VFA via LSPI
36. Experimental results on small discrete MDPs
37. Theres a pretty strong dichotomy in the quality of policies – when 10 or 15 basis functions are used the policy is esentially equivalently bad; for 20 or 35 it is equivalently good
1. <This means that you need a certain number of eigenvectors to represent a good policy, and if you have fewer than that, looks like you are in trouble.  Would be more comforting to see a smoother transition in terms of policy quality as the number of bases increases>
38. Then move on to larger discrete and continuous spaces
39. Consider domains with factored state
40. In a d-dimensional space where each dimension has k options, go from an O(k^d) representation to some other representation that is O(dk^2) <not sure how that works yet – they must be throwing something out>
41. There are a number of ways of building complex graphs from simpler ones.  Here focus is on Kronecker (or tensor) sum of graphs
1. It sort of allows you to compose graphs that cover higher dimensional spaces by using lower dimensional ones. For examples, 2 chains can be used to make a grid (and furthermore into hypercubes/grids), or a chain and a loop can be used to compose a cylinder.
42. “A standard result that follows from the above theorem shows that the combinatorial graph Laplacian of a ronecker sum of two graphs can be computed from the Laplacian of each subgraph.”
43. Derive a update rule for factored RPI/LSPI in “structured domains when the basis functions can be represented as Kronecker products of elementary basis functions on simpler state spaces.  Basis functions are column eigenvectors of the diagonalized representation of a graph operator, whereas embeddings φ(s) are row vectors representing the first k basis functions evaluated on state s.  By exploiting the property that (A xor B)^T = A^T xor B^T, it follows that embeddings for structured domains can be computed as the Kronecker products of embeddings for the constituent state components.  As a concrete example, a grid world domain of size mn can be represented as a graph GG m ⊕ Gn where m and Gn are path graphs of size m and n, respectively.”
1. The basis functions can then be written in terms of eigenvectors of m and n
44. They compare this against RBFs in a cooperative multiagent domain, where RBFs learn more quickly (about 3x as fast) but converge to a poorer result (takes 2x as long to reach goal)
1. <The obvious question then is, if you put in more RBFs, the learning will slow down but will converge to something better – what happens when more are used?>
45. Now onto continuous domains
46. “The eigenfunctions of the Laplacian can only be computed and stored on sampled real-valued states, and hence must be interpolated to novel [unexperienced] stated.  We apply the Nystrom interpolation method.”
1. It has been used before for kernel methods and spectral clustering, but here the aplication to control is novel
47. “The Nystrom method interpolates the value of eigenvectors computed on sample states to novel states, and is an application of a classical method used in the numerical solution of integral equations (…).”
1. Starts out with an integral but at the end uses a summation
48. “Note that they Nystrom method does not require recalculating eigenvectors-inessence, the embedding of a new state is computed by averaging over the already computed embeddings of ‘nearby’ states.”
49. Now onto continuous stuff
50. They attach nodes together in knn fashion, where edge weights are based on a distance metric
51. This results in an asymmetric weight matrix, so an extra step is taken to make it symmetric
52. And extra steps must be taken in some cases depending on how the data step is constructed
53. They test in mountain car, acrobot, and inverted pendulum
54. Results are better when data comes from on-policy trajectories than random trajectories
55. For mountain car they use the same start state but randomize starting trajectories
1. <This was probably done through necessity – random policies take a very long time to reach the goal when initialized with 0 velocity>
56. “In the Acrobot task, the best results were obtained for 25 and 100 PVFs, and significantly poorer results for 50 PVFs.”
1. Similar nonlinartites in terms of # of performance as a function of # of bases in other domains
2. <This is bad and should not happen- I wonder if they will give an interpretation of whats going on. Its not like one way learns faster but converges to a poorer result either – some are just better than others.  This is most pronounced in acrobot, although mountain car has a weird spike (in the bad direction) in mountain car for the mid-number of PVFs early in training.>
3. Sensitivity to k in knn is not bad
57. In some tasks the combinatorial Laplacian is better and others the normalized Laplacian is better
58. Bunch of references for spectral learnning
1. “… these methods have largely (but not exclusively) been applied to nonlinear dimensionality reduction and semi-supervised learning on graphs, whereas our work focuses on approximating (real-valued) value functions on graphs.”  And VFAs are much more complex than the normal sort of function approximation
59. Future work:
1. Simultaneously learning basis functions and policies
2. Optimality and convergence guarantees?
3. Are there cases where on-policy sampling results in some sort of oscilation?
60. Directed graphs are left as future work, <although oddly it doesn’t seem to make computation that much worse – maybe there is more to it than the one equation they give, or why wouldn’t they just use that here when it makes so much more sense>
61. In this paper, proto-value functions were constructed by diagonalization, that is by finding eigenvectors, of a symmetrized diffusion operator such as the Laplacian on an undirected graph. Formally, such eigenvectors are essentially global Fourier bases and their properties have been extensively studied in Euclidean spaces (Mallat, 1989). One well-known limitation of global Laplacian bases is that they are poor at representing piecewise linear (value) functions. We have extended the approach presented in this paper to construct multiscale diffusion bases, using the recently proposed diffusion wavelet framework (Coifman and Maggioni, 2006; Bremer et al., 2006). Diffusion wavelets provide an interesting alternative to global Fourier eigenfunctions for value function approximation, since they encapsulate all the traditional advantages of wavelets (Mallat, 1989): basis functions have compact support, and the representation is inherently hierarchical since it is based on multi-resolution modeling of processes at different spatial and temporal scales.”
62. Mention possibility of reward-sensitive PVF and some related work.
63. And then “A more direct way to incorporate reward-sensitive information into PVFs is to modify the weight matrix W to take into account the gradient of the value function to be approximated. Formally , this approach is similar to estimating a function by knowing not only its values at sample points, but also its gradient. Of course, any errors in the estimation of such gradients will then be reflected in the weight matrix, and such an approach is not also without some drawbacks. While making bases sensitive to rewards can lead to superior results, if the reward function or policy is modified, reward-sensitive basis functions would need to be re-learned.
64. Possibility of learning state-action graphs, instead of doing it uniformly across states from a random policy and then copying one of these for each action.
65. Semi-MDPs get a mention
66. <Overall, the methods in the paper seem extremely complex when considering the entire system, and empirical results aren’t outstanding when considering other methods such as VFA with RBFs, which is usually pretty terrible.  There is a ton of information, and the methods touch on a huge number of different areas, but thats more of a problem than a good thing.  If you can get similar results with methods where you don’t need detailed knowledge from so many different areas, you should use the simpler method.  I could spend months mining this paper for references and ideas and learning good stuff though.>