<I’m actually reading the Arxiv version>

- Use deep learning to directly learn RL policies off pixel input [210 x 160 rgb, 60 hz, although they downsample to 84 x 84 greyscale], reward and terminal state information
- “The model is a convolutional neural network, trained with a variant of Q-learning…”
- The form of QL uses stochastic gradient descent to update weights

- Apply the algorithm to 7 games included in the benchmark “Arcade Learning Environment.” “…it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.” This is with no algorithm tuning between games
- “Most successful RL applications that operate on these [not referring to Atari] domains have relied on hand-crafted features combined with linear value functions or policy representations. Clearly, the performance of such systems heavily relies on the quality of the feature representation.”
- Deep learning is supposed to solve the problem of feature engineering by finding features automatically

- The application of deep learning to RL is nontrivial because “… most successful deep learning applications to date have required large amounts of hand-labelled training data. RL algorithms, on the other hand, must be able to learn from a scalar reward signal that is frequently sparse, noisy, and delayed [in atari the delay can be on the order of thousands of timesteps].”
- “Furthermore, in RL the data distribution changes as the algorithm learns new behaviours, which can be problematic for deep learning methods that assume a fixed underlying distribution… To alleviate the problems of correlated data and non-stationary distributions, we use an experience replay mechanism […] which randomly samples previous transitions, and thereby smooths the training distribution over many past behaviors.”
- In order to combat POMDPness of the domain they represent the state as a high-order (order 4) MDP
- Dealing with impure policies
- Off-policy
- Epsilon-greedy exploration<?>
- TD-gammon was an early success but didn’t really lead elsewhere in terms of other big success stories, which watered-down enthusiasm for ANNs and VFAs
- Also there is all the research that deals with the many cases where VFA diverges, so a fair amount of research went to VFAs with convergence guarantees
- Most similar previous work is neural fitted Q-learning “However, it [NFQ] uses a batch update that has a computational cost per iteration that is proportional to the size of the data set, whereas we consider stochastic gradient updates that have a low constant cost per iteration and scale to large data-sets.”
**Also, previous work on NFQ used the method on the pure visual input as an autoencoder, which generates features independent of the value function. The approach here generates representations with consideration of the value function.**

- “Recent breakthroughs in computer vision and speech recognition have relied on efficiently training deep neural networks on very large training sets. The most successful approaches are trained directly from the raw inputs, using lightweight feature updates based on stochastic gradient descent. By feeding sufficient data into deep neural networks, it is often possible to learn better representations than handcrafted features […].”
- Whereas TD-Gammon was completely online, here experience replay is used. Another reason experience replay is good is: “…learning directly from consecutive samples is inefficient, due to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates.”
- Also, on policy updates may cause algorithm to get stuck in a local minima or diverge (they have a citation for this <I think I’ve seen it and the proof depends on the form of VFA used>)

- It only stores a fixed number of most recent history samples <although I think its a million samples> and samples from those uniformly
- “A more sophisticated sampling strategy might emphasize transitions from which we can learn the most, similar to prioritized sweeping […].”

- The network has an output for each action, and the output corresponds to the estimated q-value of each action based on the current 4-step history
- Network has 3 hidden layers
- They also temporally extend actions (most games were 4 steps but in space invaders this introduced an artifact so they used 3, which was the only difference between games)
- In terms of evaluation “The average total reward metric tends to be very noisy because small changes to the weights of a policy can lead to large changes in the distribution of states the policy visits.”
- Instead they show results from estimated q value the agent visits
- <Not so happy about this – its evaluating the agent based on an evaluation, which can do all sorts of crazy things (like diverge). Better to show average total reward and average that over many experiments if need be because at least that is grounded in truth. If you could know the
*actual*Q-value (instead of that estimated by the agent) that would be best of course. Oh anyway later in the comparison to linear SARSA they go back to cumulative reward.> - On the plus side, showing the change in Q value shows that it didn’t diverge.

- Have a nice example of how the rolling value estimate of the current state changes during a short sequence of gameplay
- Compare to linear SARSA on hand engineered features, and something else similar to that but with a little extra domain knowledge; also compare to a human expert and random policy. Finally there is also an implementation of evolutionary policy search