Human-level control through deep reinforcement learning. A billion authors. Nature 2014

Yet another version of this paper

  1. animals are able to learn to act by combining RL with hierarchical perception
  2. RL has generally only been effective in settings that are either low-D or require handcrafted representations
  3. Train a deep Q-network
  4. Reached a level of a professional human game tester in 49 games, with no change to hyperparameters
  5. They mention that value function divergence with an NN is problematic, but mitigate the issue by using experience replay to spread samples so it doesnt overfit recent data, as well as by doing only occasional value updates
  6. Used SGD to train
  7. The actual score achieved rises much more slowly (pretty linear over the 200 training epochs) than the believed action value (which rises sharply and then plateaus by about 40 epochs).
  8. Plot embeddings of last layer with t-sne – similar states (from visual level) are nearby, as are states of similar believed value
  9. In some cases it is able to lear non-myopic behavior (such as building a route through the blocks to the top in breakout), but in other games like pac-man or montezumas revenge it isn’t able to do much of anything (monetzumas revenge seems to be legitimately a very difficult game to learn, but pac-man doesn’t seem bad at all)
  10. “Notably, the successful integration of reinforcement learning with deep network architectures was critically dependent on our incorporation of a replay algorithm involving the storage and representation of recently experienced transitions. Convergent evidence suggests that the hippocampus may support the physical realization of such a process in the mammalian brain, with the time-compressed reactivation of recently experienced trajectories during offline periods (for example, waking rest) providing a putative mechanism by which value functions may be efficiently updated through interactions with the basal ganglia”
  11. Network is actually not terribly deep <4 layers?>
  12. Network only sees rewards as -1,0,or 1 which helps keep the derivative of the error in check, but means the agent can’t differentiate between different levels of goodness or badness
  13. epsilon greedy exploration with eps starting at 1 and reaching a minimum of 0.1
  14. Train for 50 million frames/equivalent to 38 days of game playing.  Experience replay of 1 million frames
  15. New actions are selected only every 4 frames
  16. Evaluation was done with eps = 0.05 “This procedure is adopted to minimize the possibility of overfitting during evaluation. “
  17. Use experience replay because “Second, learning directly from consecutive samples is inefficient, owing to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. Third, when learning on-policy the current parameters determine the next data sample that the parameters are trained on. For example, if the maximizing action is to move left then the training samples will be dominated by samples from the left-hand side; if the maximizing action then switches to the right then the training distribution will also switch. It is easy to see how unwanted feedback loops may arise and the parameters could get stuck in a poor local minimum, or even diverge catastrophically”
  18. Mention uniformly sampling from history for experience replay is probably not efficient and something that prioritizes samples (akin to prioritized sweeping) is probably a better idea
  19. They also do something with cloning networks, looks like they actively update one but its based on the error that is generated from a network that is only updated periodically <?>

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: