Standing Balance Control Using a Trajectory Library. Liu, Atkeson. IROS 2009

1. The paper is deals with a balance controller that explicitly handles pushes
2. Similar to previous trajectory library work by the authors, the method uses a library of optimal trajectories, and a method called neighboring optimal control to generate near optimal policies
3. Two optimizations are used in sequence: first SNOPT, a parametric nonlinear optimization method, and then DDP to further refine the policy
4. They use a form of pruning to keep the trajectory library small
5. Compare against a couple of other methods: LQR, and an optimal one
6. This paper deals only with balancing, not walking, controls a torque at the foot (which is symmetric), and knee
7. “DDP is a second order gradient method and it can converge to a better solution if the starting trajectory is good, otherwise the convergence is slow or it may even fail.  Parametric nonlinear programming methods have been used to solve trajectory optimization problems.  We find they are generally more robust in terms of finding a solution than DDP.”
8. “SNOPT is a general-purpose system for constrained optimization using a sparse sequential quadratic  programming (SQP) method.
1. This is used to generate starting trajectories (not clear what this means to me)
9. For constant pushes, initial joint angles and velocities are all zero
10. For constant pushes, the robot eventually leans into the pushes and attains zero joint torque
11. In order to return to standing after the push terminates, initial conditions with nonzero joint angles are also considered; they make a library by dividing the 2 push variables (location and magnitude) over a grid.
1. Seems pushes are always horizontally oriented
12. DDP is used to refine the trajectories produced by SNOPT
13. There is an adaptive method that defines the boundaries on the cells of trajectory libraries so performance is guaranteed to a certain bound
1. At this point its not clear how they make that computation
14. They compare the performance to an optimal controller – I’m not sure what that is though (is it just a policy computed live instead of from the library?)
15. Unfortunately, the paper really lacks details (its an IROS paper and is 6 pages with references, really a 5 page paper).  Hard to really do anything with what is outlined here unless there is an ARXIV or some other extended version.

Policies Based on Trajectory Libraries. Stolle, Atkeson. ICRA 2006

Just writing a quick note to myself after a cursory read yesterday.  Paper at a very high level discusses the idea that trajectory libraries that are generated offline can be used for rapid decision making online.  The concept here is used by the impressive work by Tassa and Erez for their different forms of real time (or near real time) planning.\

As I said, there isn’t really a particular algorithm that is discussed, and the experimental results are pretty weak.  If I read it myself without reading the previous work I mentioned I would probably just dismiss it.  But the proof is in the pudding.  I should try it out on my stuff.

Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms. Lihong Li, Wei Chu, John Langford, Xuanhui Wang. WSDM 2011.

On videolectures:http://videolectures.net/wsdm2011_li_uoe/

1. Clearly, basically about contextual bandits
2. Introduce a replay methodology for contextual bandit algorithm evaluation.
1. Totally data driven (another method is to construct a simulator)
3. Method is provably unbiased
4. Motivation is similar to what we did in Pitfall.  It can be costly to run each potential method in real life, so the goal is to generate enough data that we can effectively simulate each method from the logged data
5. “…data in bandit-style applications only contain user feedback for recommendations that were actually displayed to the user, but not all candidates.  This ‘partial-label’ nature raises a difficulty that is the key difference between evaluation of bandit algorithms and supervised learning ones.”
6. The treat RL from batch data as an “off-policy evaluation problem”
7. If I read this again skip everything before section 3.
8. I don’t get why, but the claim is that each history has an identical probability in the real world as in the policy evaluator.  Therefore,
9. Basically the process is:
1. Go through data corpus sample by sample
2. For a logged sample <s_i,a_i,r_i>, give the algorithm the context/state, s_i
3. The algorithm returns desired action a
4. If a_i =a, give the sample to the program to update its history.  Else just ignore this sample and move on
10. Estimation error approaches 0 with increasing data.
1. Error is O(sqrt(K/|D|)), |D| is size of data set; drops off quickly
11. Empirical results of results on corpus consisting of Yahoo click through, actual results fit very nicely with bounds

Model-Free Monte Carlo–like Policy Evaluation. Raphael Fonteneau, Susan A. Murphy, Louis Wehenkel, Damien Ernst. AISTATS 2010

(I accidentally had the wrong title here before, fixed)

1. Seems similar in spirit to (same 1st author): Inferring bounds on the performance of a control policy from a sample of trajectories.
2. Attempts to do value approximation by using corpus of SARSes, which is made of a number of “broken trajectories”
3. Assumption everything is Lipshitz
4. Strength is method doesn’t use FAs
5. Stochasticity is tolerated
6. Gives equations for bias and variance of monte carlo estimators
7. Algorithm is setup to minimize discrepancy between rollout resulting from corpus vs that from true model if it was available
8. Each sample in the corpus is used at most once (not sure why thats necessary)
9. Sample used for the transition is one that minimizes a particular distance metric in the SxA space
10. Gives bias and variance of their evaluator
1. “When the sample sparsity becomes small, the bias of the estimator decreases to zero and its variance converges to the variance of the Monte Carlo estimator.”

Approximately Optimal Approximate RL. Kakade, Langford

• Its basically about a safer version of policy search
• Whats wrong with current methods of policy search?  They present a couple of very simple/easy MDPs where policy gradient methods take an extremely long (exponential) amount of time to solve.
• Requires what they call a restart distribution, which is weaker than a generative model, gives samples of next states from a chosen distribution, which allows the algorithm to get data from states that normally would not be visited
• Requires a black box greedy policy chooser which picks what it thinks is the best action based on the current policy.  The trick here is that the greedy policy chooser isn’t always followed
• Algorithm is called conservative policy iteration which has the following properties:
• Policy is improved in a more uniform manner over the state space, by incorporating exploration allowed by restart distribution
• A more conservative policy update is done which keeps some of the current policy and improves the policy of the other states, which is supposed to alleviate problems of policy degredation that arise from approximate methods
• Results are for an algorithm that is near-optimal that requires a small number of steps independent of the state space size
• Policy generated is stochastic
• Seems to be for discrete spaces?
• They want answers to the following questions:
• Is there a performance measure that is guaranteed to improve at each step?
• How difficult is it to verify if a particular update improves this measure? (This seems most related to Qpi)
• After a reasonable # of policy updates, what is the performance level?
• Argue that greedy DP and policy gradient methods dont have good answers to these questions
• Their arguments against greedy DP aren’t very convincing, but they do say there isn’t enough theoretical results on the approach to say that it is good w.r.t. to the three issues above
• At least policy gradient is always guaranteed to improve (I’m not sure this is so imporant, thats why we care about explore-exploit).  Somewhat related to this, they say the problem with gradient methods is that because of lack of exploration, a very large number of samples is required to estimate gradient direction.
• Give argument against good performance of any on-policy method (gradient is one of these) in simple hallworld-domain.  Issue is that the gradient is almost always zero unless the goal is found, which leads to random (undirected) wandering, which makes it hard to find the goal, so the gradient seems to be zero everywhere unless a really huge amount of wandering is done (say importance sampling exists, but isnt very helpful)
• Have an even worse example in a trivial 2-state MDP
• Basic argument is that policy gradient methods may require data from hard to reach states based on current policy to improve policy
• Trick is to optimize the policy wrt a more global distribution over start states as opposed to the real distribution over start states, it makes sure you optimize all states
• Discuss policy improvement w.r.t. gradient, give a bound in expected benefit of doing policy improvement, the greater the policy advantage the greater the performance increase
• Use an epsilon greedy policy chooser, policy improvement no longer occurs when the greedy chooser no longer grants an advantage
• Conservative Policy Iteration Algorithm:
• Get epsilon-greedy policy
• Estimate the advantage (if this is a required step, I don’t think this paper is relevant to Qpi.  Nevermind, they talk about how to estimate this)
• If advantage is small stop
• Else update policy and loop
• Says optimal policy advantage of pi is less than 2 epsilon (but this variable is overloaded, so I’m not sure which they mean)
• Algorithm terminates in polynomial time
• In order to make an epsilon-optimal accurate estimate of the advantage, a certian number of restarts are done and I guess the result is the mean of the trajectories, use Hoeffding bound
• I think the meat of where this is relevant to Qpi is on page 6
• Show with proof that using a uniform starting distribution is more helpful than the actual starting distribution to compute the advantage
• Quick discussion of algorithm used with function approximators

Inferring bounds on the performance of a control policy from a sample of trajectories. Fonteneau, Murphy, Wehenkel, Ernst

1. Interested in inferring bounds on finite horizon return of a policy from an off-policy sample of trajectories
1. Is this relevant to qPi stuff?
2. Assumes dynamics, policy, and reward are deterministic and Lipshitz
3. Does not require trajectories; only requires SARSes
4. Discuss an algorithm very similar to Viterbi to identify the the sequence of SARSes leading to the best bound, tightness of bound is also covered
1. So is qPi solved then, no qPi is the other way around.
2. Bound is related to alpha*C, where C is some constant and alpha is the maximum distance between any element of the state-action space and its closest SARS sample
5. Policies considered can be time-dependent but the domain itself is time-invariant
6. Everything in domain is Lipshitz (with known constant), but dynamics are unknown
7. Goal is to find a lower bound on the return over T steps in an MDP for any policy from any start state
8. This paper finally proves that the value function is Lipshitz if the MDP is as well.  I’ve seen many many papers make this claim, but this is the first I’ve seen that proves it.
9. Flipping equations can give upper bounds instead of lower bounds.
10. “…the lower (and upper) bound converges at least linearly towards the true value of the return with the density of the sample”
11. “The proposed approach could also be used in combination with batch-mode reinforcement learning algorithms for identifying the pieces of trajectories that influence the most the lower bounds of the RL policy and, from there, for selecting a concise set of four-tuples from which it is possible to extract a good policy.”

QPi Acrobot data generated, logged

Just a quick post, but finished up logging QPi data for Acrobot. Experimental settings were essentially the same as in hillcar (50 trajectories for pi, and pi_e, same distribution of time for qPi, all variables perturbed by N~(0, 0.01) noise).

A visualization from one of the logs:

X:base joint angle, Y:joint angle at next step, color coded by action

I have to say, visualizing the data with Weka is very valuable; it gives a much better understanding of what is going on than you could ever get from just looking at log data. From the figure above, for example, its easy to tell that blue and green correspond to loading torque in one direction or the other and that red is the no-torque action.

So the question from here is what next?  I could keep logging more data, or perhaps more of the math/theory part should be next?

Hillcar with qPI and Weka

So I used my fitted Q-iteration code to build a policy for hillcar which is not optimal.  When run it took about 250 steps to reach the goal (in this particular implementation, optimal seems to be just under 100 steps).  For the experiments where the data was collected, noise was added to the position (normally distributed with μ=0, σ=0.01) because otherwise everything would be deterministic and all the trajectories that followed the policy alone would be identical.

For the experiments with pi_e (for exploration), pi_e was set to repeat the last action with probability 0.97, and otherwise to switch actions (bang bang control was used).  pi_e was used only at the beginning of a trajectory, and was used for a number of steps uniformly distributed between 20 and 100, and that data was logged as well.

50 trajectories were recorded in each setting, where every trajectory lasted from the start until the terminal state was reached.

Hillcar without Pi_e

Hillcar with Pi_e

The coloring is for the action taken – blue action corresponds to a left movement, red is to the right.

ARFFs are available: Hillcar without pi_e and Hillcar with pi_e.

I think overall the data actually looks pretty similar, except there is more of a blue swirl on the region immediately to the left of center in the data that uses pi_e.  Maybe I should have left pi_e on longer.   You can see part of the suboptimality of the policy is that it gets hung up going left on the leftmost end of the valley.

After that, I wrote fitted Q-iteration algorithm using Weka.  The entire program is just 150 lines, and took just an afternoon to write.  In particular I used their artificial nerual nets.  Since this is not an averager, it isn’t safe to use in algorithms like q-iteration, and in this instance the value function seems to doubling every iteration, so clearly something is wrong.  At any rate, this was meant more as a proof of concept than anything, and it shows that leveraging existing supervised learning frameworks can be a win.  Aside from that, Weka’s visualization tools can be very valuable for quickly inspecting data and results.  Here’s what the results of q-iteration look like for hillcar:

Hillcar value function, blue is low value, yellow is high value

So there it is.  I hope to have data from Acrobot sometime during the next week.