- Approach is to solve MDPs by iteratively considering low-D reps and approx optimal policies
- “Exact solutions of discounted and average-reward MDPs are expressed in terms of a generalized spectral inverse of the Laplacian called the Drazin inverse.”
- Algorithm is called representation policy iteration (RPI)
- Model-free and model-based versions are given
- “Two approaches for dimensionality reduction of MDPs are described based on geometric and reward-sensitive regularization, whereby low-dimensional representations are formed by diagonalization or dilation of Laplacian operators.”
- “… the aim is to solve MDPs by automatically finding ‘lowdimensional’ descriptions of ‘high-dimensional’ functions on a state (action) space.”
- The relevant functions are the policy, transitions, and value
- Method tries to find representations based on the underlying manifold dynamics
- Means method is usable in continuous (state) problems
- In the Laplacian off-diagonal elements are nonpositive, and row sums are 0
- Means they are singular and therefore in general don’t have a direct inverse
- Moore-Penrose pseudoinverse is the most common method when direct inverse isn’t possible
- But something called the Drazin inverse is for spectral cases and is very important in study of Markov chains
- In particular, for an MDP M with transition matrix T, care about
- A=I-T
- As well as the Drazin inverse of A
- But then the go on to call A, L, and T, P, so L=I-P
- Getting a good low-dimensional representation depends on finding the right basis functions that span the underlying manifold
- Deciding what basis to use is fundamentally about making tradeoffs – for example, in a discrete domain, using tabular methods will yield optimal results but will be expensive and noncompact; using a sparser representation is cheaper in some ways <although with my critical CS hat on, I think the big-O costs are really the same, as there are order-3 space and computational costs in the linear algebra that don’t go away even when using fewer bases>
- “… the basis construction problem may require as much or more computational effort than required to solve the original MDP.”
- On the other hand, in the case where transfer can be utilized, this extra cost can me amortized across instances
- Two main classes of methods are considered for basis construction in MDPs:
- Diagonalization (of the graph Laplacian)
- Dilation: “For example, a dilation operator on the space of functions on real numbers is Tf(x) = f(2x). Several dilation-based approaches will be compared, including methods based on Krylov spaces, a standard approach of solving systems of linear equations […]. Applied to MDPs, this approach results in the reward function being ‘dilated’ by powers of some operator, such as the transition matrix […]. A novel basis construction method called Drazin bases is described in this paper, which uses the Drazin inverse of the Laplacian LD… the discounted value function of a MDP can be written in terms of a series of powers of the Drazin inverse.”
- Another option is wavelets, or more specifically for this case, diffusion wavelets are also based on dilation.
- Considers average and discounted reward cases (for average reward cases, value is called gain)
- Mentions Laplacians for directed graphs <but I’m pretty sure this isn’t an option for arbitrary directed graphs, only special cases.>
- In general, need to consider “generalized inverses” of the Laplacian – it may be noninvertible if it is not of full rank
- There are 6 properties the actual inverse needs to satisfy; the Moore-Penrose pseudo inverse satisfies 4 and will be used for inverses related to value functions
- The Drazin inverse is important in the study of Markov chains
- <Paper is quite long – skimming most of it>
- The graph Laplacian induces a reproducing kernel hilbert space (RKHS)
- Approximate methods for solving MDPs are discussed: Least-squares, linear programming, Hilbert space
- “All these approaches depend crucially on a choice of basis functions for constructing a low-dimensional representation of the MDP, and assume these are provided by the human designer. <I think topic of basis construction is addressed in a later chapter>”
- “An elegant and general way to formalize value function approximation builds on the framework of aproximation in a type of vector space called a Hilbert space […].”
- A Hilbert space is a vector space where length is defined as an inner product between any two vectors (for example, Euclidian space is a Hilbert space, with inner product simply being fTg.
- For MDPs, distance has to be weighted according to frequency of visitation of states
- If bases are orthonormal, projection to find most similar elements can be defines by abstract Fourier series
- Least-squares approaches consist of those based on Bellman residual, and that of fixed points
- <This stuff doesn’t work, moving along>
- LSTD
- LSPI
- VI via RKHS is related to support vector regression
- “… it is possible to show that a least-squares approximation of the value function on a basis matrix Φ is equivalent to solving an exact ‘low-dimensional’ MDP…”
- Given bases and policy, an approximate low-D reward and transition model are defined
- By Parr: Given bases, the exact solution to approx policy evaluation defined by low-D rewards and transitions is the same as the fixed point solution of exact policy evaluation on the full MDP projected onto the bases.
- The computational complexity is reduced from cubic in n <the number of states I assume> to cubic in k <the number of bases I assume>
- “For continuous <state> MDPs, the basis functions are represented on a set of ‘sampled’ states.”
- Considers cases of reward-aware and reward-agnostic basis functions
- Construction of bases according to policies is done during RPI
- “It is possible to make the policy-specific bases additionally customized to a particular reward, or indeed, to have them invariant to the reward function. At every new round of policy iteration, a new set of bases will have to be generated. Consequently, in this formulation, the bases constructed have a fairly short life span, and multiple sets of bases have to be constructed for solving just a single MDP. On the other hand, by tuning the bases to a specific policy and reward function, it is possible to develop some theoretical guarantees on how well they can approximate Vπ.”
- There are also incremental (one basis at a time) and batch methods (all k at once) of basis construction
- Bases may also be constructed incrementally from samples taken from the MDP
- One method of basis construction is state aggregation
- Discusses clustering according to Bellman residual – error bounds are given
- Onto “Finding Invariant Subspaces of a MDP”
- “Unfortunately, transition matrices Pa are often not reversible, in which case, diagonalization leads to complex eigenvectors. One solution is to replace Pa with a related stochastic matrix, i.e., reversible and whose eigenspaces provide a suitable space to compress a MDP.”
- On the other hand “For the sub-class of diagonalizable transition matrices represented by reversible Markov chains, the transition matrix is not only diagonalizable, but there is also an orthonormal basis.”
- <Timing out on this one – think I basically get the idea from his other works>