- Designed for use in continuous action, high dimensional spaces
- Math is based on control theory
- Claims to be simple, with low risk of numerical divergence (doesn’t perform matrix inversions or gradient estimations). No learning rates as well
- Empirically works better than gradient-based methods, there are empirical results for littledog
- Mentions limitations of value-based, rollout, and approximate policy iteration
- In particular, says rollout methods have too many tuning parameters

- “This approach make[s] an appealing theoretical connection between value function approximation using the stochastic HJB equations and direct policy learning by approximating the path integral, i.e., by solving a statistical inference problem from sample rollouts.”
- The method here has no tuning params aside from exploration noise
- Looks like You still need to know alot about the structure of the problem as well as that it is still linear
- The algorithm also requires a noiseless rollout
- “In our case we use a special case of parameterized policies in the form of Dynamic Movement Primitives… Essentially, these policies code a learnable point attractor for a movement from y_t_0 to the goal g… The DMP equations are obviously of the form of our control system, just with a row vector as control transition matrix
- Compare against other policy search methods that perform gradient computations.
- Also mentions PoWER, which is not a gradient method, and is based on EM, but since it requires a special property for the reward function, it was not applicable to this setting

- I am not familiar with most of the comparison algorithms aside from REINFORCE. It is however, strange that the other two comparison algorithms are basically as bad or worse than REINFORCE, as it is commonly considered to be an
*extremely*inefficient algorithm. Comparing to REINFORCE is like comparing to Q-Learning- So while PI^2 does well relative to these (seems to be by an order of magnitude), it makes me wonder how much effort was spent tuning the other algorithms.
- Said PI^2 didn’t need param tuning while others did for each domain, so I can be sympathetic if they didn’t tune the others so well

- The test domains are, however, extremely high dimensional (50-DOF)
- In the little dog experiment they did learning from demonstration
- Says each degree of freedom in the littledog experiment was represented by a DMP with 50 basis functions
- I don’t know what a DMP is.
- I guess the basis functions are what the linear parameterization is being done to?

- This paper doesn’t introduce the path integral approach, but is a more general approach than the original algorithm. It would be interesting to see how different this is from the original.
- I don’t grok what the algorithm is doing yet; I would probably need to invest some time to figure out what is going on. Maybe Videolectures will help.

What’s the title of the paper?!

Sorry, I started posting unfinished notes on things and then re-editing later because WordPress kept deleting my drafts. Its updated now.