- Paper uses model-building along with policy search to get better sample complexity
- Discusses the problem of model-bias. I think this is a problem only in cases where you aren’t doing smart exploration, though. The papers he references for that are from the mid-late 90s
- Anyway, they propose using a probabilistic model of the dynamics, they use non-parametric parametric gaussian processes for this
- The uncertainty is also incorporated into planning and policy evaluation
- PILCO stands for probabilistic inference for learning control
- There are some other references for how to deal with model uncertainty that are mentioned
- They build a separate model for each dimension (of the next state being predicted)
- Does policy gradients based on the stochastic model
- The gradient is estimated analytically, which is much better than estimating it from sampling
- Its very surprising this actually works on real robots
- (its a little strange that they say the data they report is “typical”, not best or worst case – they should just present all the data and let the reader decide)

- Although the models are separate for each dimension, they covary. PILCO also takes the correlation of all control and state dimensions into account during planning and control
- They dont mention how they came up with the basis functions for the controller?
- Definitely not groking the math now, but its a neat paper.
- Learns swing up in about 90 seconds of interaction
- Mentions problems talked about by Erez in that convergence is only locally optimal, and that in domains where there are plateaus in terms of parameters/policy, the algorithm can have trouble
- The algorithm also benefits from shaping

- When they used the same approach but threw out the uncertainty of the GP it failed
- Experiments here did not use expert demonstration; the first few trajectories are randomized

A comment to 12.: The locations of the Gaussian basis functions (and a shared width) are learned. That’s the reason why the number of policy parameters is so big.

Correction to 14.: Swing-up is learned in <20 seconds, pretty reliably.