Learning Representation and Control in Markov Decision Processes: New Frontiers. Mahadevan

1. Approach is to solve MDPs by iteratively considering low-D reps and approx optimal policies
2. “Exact solutions of discounted and average-reward MDPs are expressed in terms of a generalized spectral inverse of the Laplacian called the Drazin inverse.”
3. Algorithm is called representation policy iteration (RPI)
1. Model-free and model-based versions are given
4. “Two approaches for dimensionality reduction of MDPs are described based on geometric and reward-sensitive regularization, whereby low-dimensional representations are formed by diagonalization or dilation of Laplacian operators.”
5. “… the aim is to solve MDPs by automatically finding ‘lowdimensional’ descriptions of ‘high-dimensional’ functions on a state (action) space.”
1. The relevant functions are the policy, transitions, and value
6. Method tries to find representations based on the underlying manifold dynamics
1. Means method is usable in continuous (state) problems
7. In the Laplacian off-diagonal elements are nonpositive, and row sums are 0
1. Means they are singular and therefore in general don’t have a direct inverse
2. Moore-Penrose pseudoinverse is the most common method when direct inverse isn’t possible
3. But something called the Drazin inverse is for spectral cases and is very important in study of Markov chains
8. In particular, for an MDP M with transition matrix T, care about
1. A=I-T
2. As well as the Drazin inverse of A
3. But then the go on to call AL, and T, P, so L=I-P
9. Getting a good low-dimensional representation depends on finding the right basis functions that span the underlying manifold
10. Deciding what basis to use is fundamentally about making tradeoffs – for example, in a discrete domain, using tabular methods will yield optimal results but will be expensive and noncompact; using a sparser representation is cheaper in some ways <although with my critical CS hat on, I think the big-O costs are really the same, as there are order-3 space and computational costs in the linear algebra that don’t go away even when using fewer bases>
11. “… the basis construction problem may require as much or more computational effort than required to solve the original MDP.”
1. On the other hand, in the case where transfer can be utilized, this extra cost can me amortized across instances
12. Two main classes of methods are considered for basis construction in MDPs:
1. Diagonalization (of the graph Laplacian)
2. Dilation: “For example, a dilation operator on the space of functions on real numbers is Tf(x) = f(2x).  Several dilation-based approaches will be compared, including methods based on Krylov spaces, a standard approach of solving systems of linear equations […].  Applied to MDPs, this approach results in the reward function being ‘dilated’ by powers of some operator, such as the transition matrix […].  A novel basis construction method called Drazin bases is described in this paper, which uses the Drazin inverse of the Laplacian LD… the discounted value function of a MDP can be written in terms of a series of powers of the Drazin inverse.”
3. Another option is wavelets, or more specifically for this case, diffusion wavelets are also based on dilation.
13. Considers average and discounted reward cases (for average reward cases, value is called gain)
14. Mentions Laplacians for directed graphs <but I’m pretty sure this isn’t an option for arbitrary directed graphs, only special cases.>
15. In general, need to consider “generalized inverses” of the Laplacian – it may be noninvertible if it is not of full rank
1. There are 6 properties the actual inverse needs to satisfy; the Moore-Penrose pseudo inverse satisfies 4 and will be used for inverses related to value functions
2. The Drazin inverse is important in the study of Markov chains
16. <Paper is quite long – skimming most of it>
17. The graph Laplacian induces a reproducing kernel hilbert space (RKHS)
18. Approximate methods for solving MDPs are discussed: Least-squares, linear programming, Hilbert space
1. “All these approaches depend crucially on a choice of basis functions for constructing a low-dimensional representation of the MDP, and assume these are provided by the human designer. <I think topic of basis construction is addressed in a later chapter>”
19. “An elegant and general way to formalize value function approximation builds on the framework of aproximation in a type of vector space called a Hilbert space […].”
20. A Hilbert space is a vector space where length is defined as an inner product between any two vectors (for example, Euclidian space is a Hilbert space, with inner product simply being fTg.
21. For MDPs, distance has to be weighted according to frequency of visitation of states
22. If bases are orthonormal, projection to find most similar elements can be defines by abstract Fourier series
23. Least-squares approaches consist of those based on Bellman residual, and that of fixed points
1. <This stuff doesn’t work, moving along>
24. LSTD
25. LSPI
26. VI via RKHS is related to support vector regression
27. “… it is possible to show that a least-squares approximation of the value function on a basis matrix Φ is equivalent to solving an exact ‘low-dimensional’ MDP…”
28. Given bases and policy, an approximate low-D reward and transition model are defined
29. By Parr: Given bases, the exact solution to approx policy evaluation defined by low-D rewards and transitions is the same as the fixed point solution of exact policy evaluation on the full MDP projected onto the bases.
1. The computational complexity is reduced from cubic in n <the number of states I assume> to cubic in k <the number of bases I assume>
30. “For continuous <state> MDPs, the basis functions are represented on a set of ‘sampled’ states.”
31. Considers cases of reward-aware and reward-agnostic basis functions
32. Construction of bases according to policies is done during RPI
1. “It is possible to make the policy-specific bases additionally customized to a particular reward, or indeed, to have them invariant to the reward function.  At every new round of policy iteration, a new set of bases will have to be generated.  Consequently, in this formulation, the bases constructed have a fairly short life span, and multiple sets of bases have to be constructed for solving just a single MDP. On the other hand, by tuning the bases to a specific policy and reward function, it is possible to develop some theoretical guarantees on how well they can approximate Vπ.”
33. There are also incremental (one basis at a time) and batch methods (all at once) of basis construction
1. Bases may also be constructed incrementally from samples taken from the MDP
34. One method of basis construction is state aggregation
1. Discusses clustering according to Bellman residual – error bounds are given
35. Onto “Finding Invariant Subspaces of a MDP”
1. “Unfortunately, transition matrices Pa are often not reversible, in which case, diagonalization leads to complex eigenvectors. One solution is to replace Pa with a related stochastic matrix, i.e., reversible and whose eigenspaces provide a suitable space to compress a MDP.”
1. On the other hand “For the sub-class of diagonalizable transition matrices represented by reversible Markov chains, the transition matrix is not only diagonalizable, but there is also an orthonormal basis.”
36. <Timing out on this one – think I basically get the idea from his other works>