- Deterministic policy gradient for
**continuous action**MDPs- Deterministic in that the action selected by the algorithm is deterministic as opposed to a stochastic
- Not a function of stochasticity in the domain <so far>

- “The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more effectively than the usual stochastic policy gradient.”
- Exploration is driven by off-policy actor-critic
- This method is shown to beat stochastic policy gradient in high-D spaces
- “It was previously believed that the deterministic policy gradient did not exist, or could only be obtained when using a model (Peters, 2010). However, we show that the deterministic policy gradient does indeed exist, and furthermore it has a simple model-free form that simply follows the gradient of the action-value function.”
- The determinstic policy gradient is simply equivalent to the stochastic policy gradient as policy variance approaches 0.
- The basic difference between the deterministic vs stochastic policy gradient is that the first only integrates over state, while the latter also integrates over actions (so doing stochastic requires more samples <and similarly is more error-prone>)
- The benefit of stochastic policy gradient is that it has a nice way of doing exploration built-in. Here an actor-critic setup is used to drive exploration: “We use the deterministic policy gradient to derive an off-policy actor-critic algorithm that estimates the action-value function using a differentiable function approximator, and then updates the policy parameters in the direction of the approximate action-value gradient. We also introduce a notion of
*compatible*function approximation for deterministic policy gradients, to ensure that the approximation does not bias the gradient.” - Experiments on a high-D bandit, some other stuff, and
**octopus arm** - “… compatible <in that the critic does not introduce bias> function approximators are linear in ‘features’ of the stochastic policy….”
- Discusses off-policy actor critic, which is a combination of policy gradient and TD, provides an approximation of the true gradient
- Has to use importance sampling to compensate for the fact that the state distribution that is being observed and the one the alogrithm would like to generate are different.

- “In continuous action spaces, greedy policy improvement becomes problematic, requiring a global maximization at every step. Instead, a simple and computationally attractive alternative is to move the policy in the direction of the gradient of
*Q*, rather than globally maximizing*Q*.” - The estimate of the action parameterization-value gradient depends on the distribution of states visited, but it turns out the gradient of the state distribution does not have to be calculated
- Naturally, as in the case of stochastic case, a differentiable estimate of the Q function must be used.
- In general, this will not preserve the gradient of the true value function (it is called
*compatible*) but there are classes of FAs that will

- In general, this will not preserve the gradient of the true value function (it is called
- Start with a description of gradient SARSA
- Then move to off-policy deterministic actor critic
- Linear FAs are compatible, and can be effective if they only have to locally dictate how to adjust parameters <but is it used only locally?>
- Seems like it can be linear in any set of basis functions, though

- Minimizing squared-error
- “To summarise, a
*compatible off-policy deterministic actor-critic*(COPDAC) algorithm consists of two components. The critic is a linear function approximator that estimates the action-value from features [math]… This may be learnt off-policy from samples of a behaviour policy β(*a*|*s*), for example using Q-learning or gradient Q-learning. The actor then updates its parameters in the direction of the critic’s action-value gradient.” - Although off-policy QL may diverge when using linear VFA, there are now methods that are safer, which is what is used here
- Computational complexity is
*mn*where*m*= |*A*| and*n*is the number of policy parameters <which may be |*S*||*A*|?> - On to experimental results
- High-D (D = 10, 25, 50) quadratic bandit. Seems like a pretty trivial problem – perf in the 50D case converges at around 1000 samples. Stochastic is still almost at exactly the same performance it started with at that point
- Then work in mountain car, pendulum, puddle world <at least in the first two, exploration isn’t trivial, although not extremely difficult>
- <Because the VFA is being done linearly, this doesn’t solve the problem of engineering features that allow the problem to be solvable, which is a fundamental issue in continuous RL>
- In octopus, reward is the distance from the arm to the target, so there is a nice smooth landscape to optimize on
- State space is simplified to 6-D
- VFA is done by ANN

- <Discussion>
- “Using a stochastic policy gradient algorithm, the policy becomes more deterministic as the algorithm homes in on a good strategy. Unfortunately this makes stochastic policy gradient harder to estimate, because the policy gradient ∇
_{θ}π_{θ}(*a*|*s*) changes more rapidly near the mean. Indeed, the variance of the stochastic policy gradient for a Gaussian policy*N*(μ, σ^{2}) is proportional to 1/σ^{2}(…), which grows to infinity as the policy becomes deterministic. This problem is compounded in high dimensions, as illustrated by the continuous bandit task. The stochastic actor-critic estimates the stochastic policy gradient in … The inner integral …[math], is computed by sampling a high dimensional action space. In contrast, the deterministic policy.”