Path Integral Policy Improvement with Covariance Matrix Adaptation. Stulp, Sigaud. ICML 2012


  1. This looks right up my alley.  Discusses policy search methods.  Compares PI^2 to Cross-Entropy and CMAES.
  2. Then proposes a new algorithm PI^2-CMA, whose main advantage is that it determines magnitude of exploration noise automatically
  3. 2 continuous RL papers they cite are:
    1. Policy Search for Motor Primitives in Robotics.  Kober, Peters.  Machine Learning 2011.
    2. A Generalized Path Integral Control Approach to Reinforcement Learning.  JMLR 2010.
  4. Says best algorithms now are Policy Improvment with Path Integrals (PI^2)
    1. The claim is that it outperforms REINFORCE and Natural Actor Critic by an order of magnitude in terms of speed and quality
    2. But at least REINFORCE is total garbage so that is not an impressive claim to be able to make
  5. PI2 is different from policy gradient methods in that it uses probability-weighted averaging to do the parameter update instead of an estimate of the gradient.
    1. CMAES (Covariance Matrix Adaptation-Evolutionary Strategy), as well as CEM (Cross-Entropy Methods) also use this basic idea
  6. Although all 3 algorithms have same basic component, each arrived at the rule from different principles
  7. By reinterpreting CEM as performing probability-weighted averaging, it can be shown that CEM is a special case of CMAES
  8. For the Cross-Entropy method, a particular form of probability-weighted averaging is performed, where the “good” samples get probability 1/k (if there are k of them), and the “bad” samples get probability 0.
  9. CEM for policy optimization was introduced in:
    1. The Cross-Entropy Method for Fast Policy Search.  Mannor, Rubenstein, Gat. ICML 2003.  This was primarily for finite MDPs, they mention the extension to continuous spaces
  10. Cross-Entropy in continuous spaces was more thoroughly introduced in:
    1. Cross-Entropy Optimization of Control Policies with Adaptive Basis Functions.  Busoniu, Ernst, Schutter, Be, Babuska.  IEEE Transactions on Systems, Man, and Cybernetics.  2011.
  11. CEM has also been used with sampling-based motion planning:
    1. Learning Cost-efficient Control Policies with XCSF: Generalization capabilities and Further Improvement.  Marin, Deock, Rigoux, Sigaud.  Genetic and Evolutionary Computation
  12. CMAES is very similar to CEM, except it uses a more sophisticated method to update the covariance matrix
    1. There is also an extra variable that scales the covariance matrix and therefore controls exploration
    2. It also keeps track of the history of parameter changes and uses this “evolution path” to help speed convergence based on correlations between generations
  13. For a description of CMAES, see:
    1. Completely Derandomized Self-adaptation in Evolution Strategies.  Hansen, Ostermeier. Evolutionary Computation 2001.
  14. For example of CMAES applied to double pole balancing, see:
    1. Evolution Strategies for Direct Policy Search.  Heidrich-Meisnerm, Igel.  Parallel Problem Solving from Nature 2008
  15. In empirical results, uses Dynamic Movement Primitives
    1. A Generalized Path Integral Control Approach to Reinforcement Learning.  Theodorou, Buchli, Schaal.  JMLR 2010
  16. “Using parameterized policies avoids the curse of dimensionality associated with (discrete) state-action spaces, and using probability-weighted averaging avoids having to estimate a gradient, which can be difficult for noisy and discontinuous cost functions”
  17. Seems like PI^2 has high computational costs relative to something like Cross-Entropy
  18. Also discusses PoWeR algorithm, but says performance is basically the same is PI^2
  19. “PI^2’s properties follow directly from first principles of stochastic optimal control.”  On the other hand, CMAES and CEM have less formal backing
  20. The evaluation task is a time-dependent 10-DOF arm, previously used in the Theodorou paper
  21. Unlike in CEM/CMAES, the covariance matrix in PI^2 does not vary; only the mean does.  This is done in order to have the derivation of PI^2 go through.
    1. But in this paper they ignore that and use adaptive covariance as well
    2. Which is what they call PI^2-CMA, Path Integral Policy Improvement with Covariance Matrix Adaptation.
    3. Forming PI^2-CMAES is analagous but slightly more complex
  22. PI^2 has much slower convergence than the variance proposed in this paper
  23. They have results for how initial parameterizations impact perf, doesn’t seem to matter too much
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: