PILCO: A Model-Based and Data-Efficient Approach to Policy Search. Diesenroth, Rasmussen. ICML 2011

Paper uses model-building along with policy search to get better sample complexity

Discusses the problem of model-bias. I think this is a problem only in cases where you aren’t doing smart exploration, though. The papers he references for that are from the mid-late 90s

Anyway, they propose using a probabilistic model of the dynamics, they use non-parametric parametric gaussian processes for this

The uncertainty is also incorporated into planning and policy evaluation

PILCO stands for probabilistic inference for learning control

There are some other references for how to deal with model uncertainty that are mentioned

They build a separate model for each dimension (of the next state being predicted)

Does policy gradients based on the stochastic model

The gradient is estimated analytically, which is much better than estimating it from sampling

Its very surprising this actually works on real robots

(its a little strange that they say the data they report is “typical”, not best or worst case – they should just present all the data and let the reader decide)

Although the models are separate for each dimension, they covary. PILCO also takes the correlation of all control and state dimensions into account during planning and control

They dont mention how they came up with the basis functions for the controller?

Definitely not groking the math now, but its a neat paper.

Learns swing up in about 90 seconds of interaction

Mentions problems talked about by Erez in that convergence is only locally optimal, and that in domains where there are plateaus in terms of parameters/policy, the algorithm can have trouble

The algorithm also benefits from shaping

When they used the same approach but threw out the uncertainty of the GP it failed

Experiments here did not use expert demonstration; the first few trajectories are randomized

Advertisements

Like this:

LikeLoading...

Related

One thought on “PILCO: A Model-Based and Data-Efficient Approach to Policy Search. Diesenroth, Rasmussen. ICML 2011”

A comment to 12.: The locations of the Gaussian basis functions (and a shared width) are learned. That’s the reason why the number of policy parameters is so big.

Correction to 14.: Swing-up is learned in <20 seconds, pretty reliably.

A comment to 12.: The locations of the Gaussian basis functions (and a shared width) are learned. That’s the reason why the number of policy parameters is so big.

Correction to 14.: Swing-up is learned in <20 seconds, pretty reliably.