Linear Complementarity for Regularized Policy Evaluation and Improvement. Johns, Painter-Wakefield, Parr. NIPS 2010

Based on a very rapid read-through before I go home

  1. Also for doing L1 regularization for feature selection
  2. Here, the problem is formulated as “… a linear complementarity problem (LCP).” <I don’t know what that means off the top of my head>
  3. Similar to LARS-TD, but has a few advantages
    1. Easier to apply off-the shelf tools
    2. Can be warm started
  4. LCPs can describe any convex quadratic problem, even though the basic setup is pretty simple
  5. Turns out this approach is mathematically very similar to LARS but this method is computationally cheaper
    1. <Confusing they say that> Their approach is more expensive than LARS
  6. Also, LARQ (their proposal) can get “stuck” and its difficult to correct. “This occurs when the greedy policy for a particular β is not well defined.  In such cases, the algorithm attempts to switch to a new policy immediately following a policy change.  This problem is not unique to LARQ.  Looping is possible with most approximate policy iteration algorithms.   What makes it particularly troublesome for LARQ is that there are few satisfying ways of addressing this issue without sacrificing the invariants.”
  7. They present some changes that attempt to work around these issues (for example it can’t get stuck), and call this algorithm LC-MPI
  8. In experiments they use LARS-TD and LC-TD within policy iteration
    1. Experiments are used “… a single value of the L1 regularization parameter…”
  9. Experiments on 20-state chain, as well as mountain car
  10. In the chain “For features, we used 1000 Gaussian random noise features along with five equally spaced radial basis functions … and a constant function”
    1. So there is a ton of garbage and a small amount of actual feature
    2. <On the other hand, Ali’s stuff shows it is very simple to get rid of features that are complete garbage, like white noise.  Its much harder to throw out features that carry information, but from other tasks>
  11. <They just rely on very short episodes with random state initialization to do the data collection/exploration>
  12. “When LARS-TD and LC-TD are used as subroutines within policy iteration, the process ends at a single value of the L1 regularization parameter β.  The policy iteration loop must be rerun to consider different values of β.  In this section, we show how much computation can be saved by running LC-MPI once (to produce m greedy policies, each at a different value of β) versus running policy iteration m separate times.”
    1. Warm starting helps a bunch

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: