- Approach is to solve MDPs by iteratively considering low-D reps and approx optimal policies
- “Exact solutions of discounted and average-reward MDPs are expressed in terms of a generalized spectral inverse of the Laplacian called the
*Drazin inverse*.” - Algorithm is called representation policy iteration (RPI)
- Model-free and model-based versions are given

- “Two approaches for dimensionality reduction of MDPs are described based on geometric and reward-sensitive regularization, whereby low-dimensional representations are formed by diagonalization or dilation of Laplacian operators.”
- “… the aim is to solve MDPs by automatically finding ‘lowdimensional’ descriptions of ‘high-dimensional’ functions on a state (action) space.”
- The relevant functions are the policy, transitions, and value

- Method tries to find representations based on the underlying manifold dynamics
- Means method is usable in continuous (state) problems

- In the Laplacian off-diagonal elements are nonpositive, and row sums are 0
- Means they are singular and therefore in general don’t have a direct inverse
- Moore-Penrose pseudoinverse is the most common method when direct inverse isn’t possible
- But something called the Drazin inverse is for spectral cases and is very important in study of Markov chains

- In particular, for an MDP
*M*with transition matrix*T*, care about*A*=*I**-T*- As well as the Drazin inverse of
*A* - But then the go on to call
*A*,*L*, and*T*, P, so*L=I-P*

- Getting a good low-dimensional representation depends on finding the right basis functions that span the underlying manifold
- Deciding what basis to use is fundamentally about making tradeoffs – for example, in a discrete domain, using tabular methods will yield optimal results but will be expensive and noncompact; using a sparser representation is cheaper in some ways <although with my critical CS hat on, I think the big-O costs are really the same, as there are order-3 space and computational costs in the linear algebra that don’t go away even when using fewer bases>
- “… the basis construction problem may require as much or more computational effort than required to solve the original MDP.”
- On the other hand, in the case where transfer can be utilized, this extra cost can me amortized across instances

- Two main classes of methods are considered for basis construction in MDPs:
- Diagonalization (of the graph Laplacian)
- Dilation: “For example, a dilation operator on the space of functions on real numbers is
*Tf*(*x*) =*f*(2*x*). Several dilation-based approaches will be compared, including methods based on*Krylov*spaces, a standard approach of solving systems of linear equations […]. Applied to MDPs, this approach results in the reward function being ‘dilated’ by powers of some operator, such as the transition matrix […]. A novel basis construction method called*Drazin**bases*is described in this paper, which uses the Drazin inverse of the Laplacian*L*… the discounted value function of a MDP can be written in terms of a series of powers of the Drazin inverse.”^{D} - Another option is wavelets, or more specifically for this case, diffusion wavelets are also based on dilation.

- Considers average and discounted reward cases (for average reward cases, value is called
*gain*) - Mentions Laplacians for directed graphs <but I’m pretty sure this isn’t an option for
*arbitrary*directed graphs, only special cases.> - In general, need to consider “generalized inverses” of the Laplacian – it may be noninvertible if it is not of full rank
- There are 6 properties the actual inverse needs to satisfy; the Moore-Penrose pseudo inverse satisfies 4 and will be used for inverses related to value functions
- The Drazin inverse is important in the study of Markov chains

- <Paper is quite long – skimming most of it>
- The graph Laplacian induces a reproducing kernel hilbert space (RKHS)
- Approximate methods for solving MDPs are discussed: Least-squares, linear programming, Hilbert space
- “All these approaches depend crucially on a choice of basis functions for constructing a low-dimensional representation of the MDP, and assume these are provided by the human designer. <I think topic of basis construction is addressed in a later chapter>”

- “An elegant and general way to formalize value function approximation builds on the framework of aproximation in a type of vector space called a Hilbert space […].”
- A Hilbert space is a vector space where length is defined as an inner product between any two vectors (for example, Euclidian space is a Hilbert space, with inner product simply being
*f*^{T}*g*. - For MDPs, distance has to be weighted according to frequency of visitation of states
- If bases are orthonormal, projection to find most similar elements can be defines by
*abstract Fourier series* - Least-squares approaches consist of those based on Bellman residual, and that of fixed points
- <This stuff doesn’t work, moving along>

- LSTD
- LSPI
- VI via RKHS is related to support vector regression
- “… it is possible to show that a least-squares approximation of the value function on a basis matrix Φ is equivalent to solving an exact ‘low-dimensional’ MDP…”
- Given bases and policy, an approximate low-D reward and transition model are defined
- By Parr: Given bases, the exact solution to approx policy evaluation defined by low-D rewards and transitions is the same as the fixed point solution of exact policy evaluation on the full MDP projected onto the bases.
- The computational complexity is reduced from cubic in
*n*<the number of states I assume> to cubic in*k*<the number of bases I assume>

- The computational complexity is reduced from cubic in
- “For continuous <state> MDPs, the basis functions are represented on a set of ‘sampled’ states.”
- Considers cases of reward-aware and reward-agnostic basis functions
- Construction of bases according to policies is done during RPI
- “It is possible to make the policy-specific bases additionally customized to a particular reward, or indeed, to have them invariant to the reward function. At every new round of policy iteration, a new set of bases will have to be generated. Consequently, in this formulation, the bases constructed have a fairly short life span, and multiple sets of bases have to be constructed for solving just a single MDP. On the other hand, by tuning the bases to a specific policy and reward function, it is possible to develop some theoretical guarantees on how well they can approximate Vπ.”

- There are also incremental (one basis at a time) and batch methods (all
*k*at once) of basis construction- Bases may also be constructed incrementally from samples taken from the MDP

- One method of basis construction is state aggregation
- Discusses clustering according to Bellman residual – error bounds are given

- Onto “Finding Invariant Subspaces of a MDP”
- “Unfortunately, transition matrices P
^{a}are often not reversible, in which case, diagonalization leads to complex eigenvectors. One solution is to replace P^{a}with a related stochastic matrix, i.e., reversible and whose eigenspaces provide a suitable space to compress a MDP.”

- On the other hand “For the sub-class of diagonalizable transition matrices represented by reversible Markov chains, the transition matrix is not only diagonalizable, but there is also an orthonormal basis.”

- “Unfortunately, transition matrices P
- <Timing out on this one – think I basically get the idea from his other works>

Advertisements