## Regularization and Feature Selection in LSPI. Kolter, Ng. ICML 2009

1. This is the paper that utilized L1 regularization for feature selection in LSPI
2. “Our framework differs from typical applications of L1 regularization in that solving the L1 regularized LSTD problem cannot be formulated as a convex optimization problem; despite this, we show that a procedure similar to the Least Angle Regression (LARS) algorithm is able to efficiently compute the optimal solution”
3. LSTD has a few problems: if the number of basis functions k is large, “a prohibitively” large number of samples is needed to get a good estimate for the weights.  Similarly, too little data will cause overfitting.  Also, costs related to the number of basis functions is large because matrix inversions have to be performed on a kxk matrix which can be expensive
4. Using L1 regularization with LSTD is difficult because L1 is usually used in algorithms based on convex optimization, but LSTD is based on a fixed-point solution
5. Modifications require adding a regularization term to the regular LSTD equation
6. L2 regularization is easier because it allows for closed-form solutions to the problem, but this doesn’t fix the problems brought up in 3.
7. LARS seems to do a piecewise optimizations of weights (one weight at a time)
8. The optimization from LARS involves a couple of steps. First, the optimality conditions of the problem must be determined, then the weight is solved for and updated
9. Each iteration of the algorithm has cost O(mkp^3), memory requirements of O(k+p^2) where p is the number of non-zero coefficients in w.  Says in practice the total cost of the algorithm (when efficiently implemented) is O(mkp^3).  The assumption is also that p < k, so the polynomial cost on p is not that bad (in the case where p=k, the algorithm is O(mk^4), which is quite bad)
10. The algorithm up to now was still concerned with an adaptation of the normal TD alg. to us L1 regularization; can be applied to LSTD(lambda) and LSPI
11. There is a theorem that shows LARS-TD finds an L1 regularized fixed-point under certain conditions.  Part of the algorithm (related to beta-values, a regularization parameter does not have such a guarantee).  They give  a simple example of where the algorithm fails, but say in most cases it is not an issue.
1. It is possible to identify when these cases are hit, at least
12. Experiments show L1 regularization is very robust to (thousands of) irrelevant features, while L2 is not.
1. Also show that Method with L1 regularization does not have  a cost that increases quickly as compared to the method with L2 regularization where there are many irrelevant features, due to p being small in those cases.
2. The extra features used in these cases are really just garbage noise, which is very easy to deal with by a number of methods
3. These synthetic experiments are really just to show why L1 regularization is good
13. The experiment with mountain car is more along the lines of why it would useful in practice; the experimental setup is more representative of how it would actually be useful
14. The data used for the mountain car experiment is very small; just 500 sars’ samples consisting of short trajectories from random states.  Although the method of data collection is batch-mode as opposed to regular RL, that is impressive.  The policy found is not optimal, but it is pretty good.