Continuous Control with Deep Reinforcement Learning. Lilicrap, Hunt, Pritzel, Heess, Erez, Tasssa, Silver. Arxiv 2015

  1. Extension of deep QL to continuous actions
  2. Actor-critic, model-free
  3. Deterministic policy gradient
  4. Show the algorithm running on 20 tasks in a physics simulator
  5. A followup to deterministic policy gradient paper
  6. Uses most of the tricks from the Atari paper plus something relatively new called batch normalization
  7. Ran algorithms directly on joint angle data as well as simulated camera images
  8. Alg is called deep deterministic policy gradient
  9. They are able to use same parameters for the direct state information as well as visual data
  10. Method is simple and pretty straightforward actor-critic
  11. They compare results to a planner that has access to a generative model
  12. DDPG can sometimes outperform the planner that accesses the generative model, even in some cases when working only from the visual data
  13. DPG requires:
    1. A parameterized actor function which is a mapping from states to actions
    2. Critic, which has Q-function
  14. NFCQA is basically the same as DPG but uses an NN as a FA.  Issue is it uses batch learning which doesn’t scale well.
    1. The minibatch version of this algorithm is equivalent to the original formulation of DPG
  15. Do “soft” updates of the network which makes weights change more slowly but helps prevent divergence
    1. “This simple change moves the relatively unstable problem of learning the action-value function closer to the case of supervised learning, a problem for which robust solutions exist. “
    2. Did this both for the policy and Q
  16. Method of batch normalization is an approach that helps deal with issue that different parts of the state vector may have different scales and meanings
    1. <From what I can tell here, this looks like it basically does minibatch whitening of the data, is it really such a new idea?  Need to check the paper where it is introduced.>
  17. They just add Gaussian noise to the actor in order to do exploration
  18. Most of the problem looks like came from MuJoCo, some in 2d and some in 3d, but they also did racing in Torcs
  19. Similar to the atari papers they use the last 3 frames of data to represent state
  20. Visual data is downsampled to 64×64, 80-bit
  21. “Surprisingly, in some simpler tasks, learning policies from pixels is just as fast as learning using the low-dimensional state descriptor. This may be due to the action repeats making the problem simpler. It may also be that the convolutional layers provide an easily separable representation of state space, which is straightforward for the higher layers to learn on quickly.”
  22. The planner they compare against is iLQG which <I think> is a locally-optimal controller
    1. It needs not only the model but also its derivatives
  23. “The original DPG paper evaluated the algorithm with toy problems using tile-coding and linear function approximators. It demonstrated data efficiency advantages for off-policy DPG over bothon- and off-policy stochastic actor critic. It also solved one more challenging task in which a multijointed octopus arm had to strike a target with any part of the limb. However, that paper did not demonstrate scaling the approach to large, high-dimensional observation spaces as we have here.”
  24. “It has often been assumed that standard policy search methods such as those explored in the present work are simply too fragile to scale to difficult problems [17]. Standard policy search is thought to be difficult because it deals simultaneously with complex environmental dynamics and a complex policy. Indeed, most past work with actor-critic and policy optimization approaches have had diffi- culty scaling up to more challenging problems [18]. Typically, this is due to instability in learning wherein progress on a problem is either destroyed by subsequent learning updates, or else learning is too slow to be practical.”
  25. Similar to guided policy search <?>
  26. Looks like Q estimates are close to the returns that the policies generate

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: