Policy Search for Motor Primitives in Robotics. Kober, Peters. Machine Learning 2011.

  1. Cited from to PATH INTEGRAL POLICY IMPROVEMENT WITH COVARIANCE MATRIX ADAPTATION, as an example of continuous RL with policy search
  2. Says that these methods have seen greatest success in large domains where a teacher starts learning close to the solution and then the algorithm completes learning
  3. This is the PoWER (Policy learning by Weighing Exploration with the Returns) paper
  4. Here they rely on special kinds of parameterized policies that are well suited for robotics problems
    1. Methods generally requires some expert assistance.
  5. The use this for swing up, ball-in-a-cup on a real robot arm.
  6. Uses expert instruction here as well
  7. Since these “tasks are inherently single-stroke movements, we focus on a special class of episodic reinforcement learning”
    1. So again, they are using a lot of domain expertise to make the task learnable (not that this is a bad idea when working on a real robot)
  8. The algorithm is EM-based
  9. The math for the algorithm is based on optimizing a lower bound of performance.  They say that this generally is not the way things are done in RL, although it is more common in supervised learning
    1. They show that policy gradient methods can also be derived from this method of derivation, that idea seems to be introduced here
    2. “We show that natural policy gradients can be seen as an additional constraint regularizing the change in the path distribution resulting from a policy update when improving the policy incrementally”
  10. While gradient based methods require a learning rate to be provided (which can be very troublesome to get correct), EM algorithms do not require a learning rate and converge more quickly
  11. In general during policy search, exploration is performed by corrupting the action applied at each time step by some Gaussian noise.  They note, however, that this ultimately can lead to too much change from the true policy, “washing out” the performance of the policy constructed during evaluation.  Because of this, they propose using state and time dependent noise, so the exploration can be more precisely controlled
  12. The construct they use for policy representation is well known for robotic arms and results in a linear policy
  13. To compare against other algorithms, they use a couple of simulations based on a reaching task with a robotic arm
  14. They claim that “kinesthetic teach-in” is required because the test domains are so large, but swingup is quite small (the other however is difficult enough I could see it being very tough), so I don’t buy that explanation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: