PILCO: A Model-Based and Data-Efficient Approach to Policy Search. Diesenroth, Rasmussen. ICML 2011

  1. Paper uses model-building along with policy search to get better sample complexity
  2. Discusses the problem of model-bias.  I think this is a problem only in cases where you aren’t doing smart exploration, though.  The papers he references for that are from the mid-late 90s
  3. Anyway, they propose using a probabilistic model of the dynamics, they use non-parametric parametric gaussian processes for this
  4. The uncertainty is also incorporated into planning and policy evaluation
  5. PILCO stands for probabilistic inference for learning control
  6. There are some other references for how to deal with model uncertainty that are mentioned
  7. They build a separate model for each dimension (of the next state being predicted)
  8. Does policy gradients based on the stochastic model
  9. The gradient is estimated analytically, which is much better than estimating it from sampling
  10. Its very surprising this actually works on real robots
    1. (its a little strange that they say the data they report is “typical”, not best or worst case – they should just present all the data and let the reader decide)
  11. Although the models are separate for each dimension, they covary.  PILCO also takes the correlation of all control and state dimensions into account during planning and control
  12. They dont mention how they came up with the basis functions for the controller?
  13. Definitely not groking the math now, but its a neat paper.
  14. Learns swing up in about 90 seconds of interaction
  15. Mentions problems talked about by Erez in that convergence is only locally optimal, and that in domains where there are plateaus in terms of parameters/policy, the algorithm can have trouble
    1. The algorithm also benefits from shaping
  16. When they used the same approach but threw out the uncertainty of the GP it failed
  17. Experiments here did not use expert demonstration; the first few trajectories are randomized

One thought on “PILCO: A Model-Based and Data-Efficient Approach to Policy Search. Diesenroth, Rasmussen. ICML 2011

  1. mpd37 says:

    A comment to 12.: The locations of the Gaussian basis functions (and a shared width) are learned. That’s the reason why the number of policy parameters is so big.

    Correction to 14.: Swing-up is learned in <20 seconds, pretty reliably.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: