Category Archives: Policy Gradient

Deterministic Policy Gradient Algorithms. Silver, Lever, Heess, Degris, Wierstra, Riedmiller. ICML 2014

  1. Deterministic policy gradient for continuous action MDPs
    1. Deterministic in that the action selected by the algorithm is deterministic as opposed to a stochastic
    2. Not a function of stochasticity in the domain <so far>
  2. “The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function.  This simple form means that the deterministic policy gradient can be estimated much more effectively than the usual stochastic policy gradient.”
  3. Exploration is driven by off-policy actor-critic
  4. This method is shown to beat stochastic policy gradient in high-D spaces
  5.  “It was previously believed that the deterministic policy gradient did not exist, or could only be obtained when using a model (Peters, 2010).  However, we show that the deterministic policy gradient does indeed exist, and furthermore it has a simple model-free form that simply follows the gradient of the action-value function.”
  6. The determinstic policy gradient is simply equivalent to the stochastic policy gradient as policy variance approaches 0.
  7. The basic difference between the deterministic vs stochastic policy gradient is that the first only integrates over state, while the latter also integrates over actions (so doing stochastic requires more samples <and similarly is more error-prone>)
  8. The benefit of stochastic policy gradient is that it has a nice way of doing exploration built-in.  Here an actor-critic setup is used to drive exploration: “We use the deterministic policy gradient to derive an off-policy actor-critic algorithm that estimates the action-value function using a differentiable function approximator, and then updates the policy parameters in the direction of the approximate action-value gradient.  We also introduce a notion of compatible function approximation for deterministic policy gradients, to ensure that the approximation does not bias the gradient.”
  9. Experiments on a high-D bandit, some other stuff, and octopus arm
  10. “… compatible <in that the critic does not introduce bias> function approximators are linear in ‘features’ of the stochastic policy….”
  11. Discusses off-policy actor critic, which is a combination of policy gradient and TD, provides an approximation of the true gradient
    1. Has to use importance sampling to compensate for the fact that the state distribution that is being observed and the one the alogrithm would like to generate are different.
  12. “In continuous action spaces, greedy policy improvement becomes problematic, requiring a global maximization at every step.  Instead, a simple and computationally attractive alternative is to move the policy in the direction of the gradient of Q, rather than globally maximizing Q.”
  13. The estimate of the action parameterization-value gradient depends on the distribution of states visited, but it turns out the gradient of the state distribution does not have to be calculated
  14. Naturally, as in the case of stochastic case, a differentiable estimate of the Q function must be used.
    1. In general, this will not preserve the gradient of the true value function (it is called compatible) but there are classes of FAs that will
  15. Start with a description of gradient SARSA
  16. Then move to off-policy deterministic actor critic
  17. Linear FAs are compatible, and can be effective if they only have to locally dictate how to adjust parameters <but is it used only locally?>
    1. Seems like it can be linear in any set of basis functions, though
  18. Minimizing squared-error
  19. “To summarise, a compatible off-policy deterministic actor-critic (COPDAC) algorithm consists of two components.  The critic is a linear function approximator that estimates the action-value from features [math]… This may be learnt off-policy from samples of a behaviour policy β(a|s), for example using Q-learning or gradient Q-learning.  The actor then updates its parameters in the direction of the critic’s action-value gradient.”
  20. Although off-policy QL may diverge when using linear VFA, there are now methods that are safer, which is what is used here
  21. Computational complexity is mn where m = |A| and n is the number of policy parameters <which may be |S||A|?>
  22. On to experimental results
  23. High-D (D = 10, 25, 50) quadratic bandit.  Seems like a pretty trivial problem – perf in the 50D case converges at around 1000 samples.  Stochastic is still almost at exactly the same performance it started with at that point
  24. Then work in mountain car, pendulum, puddle world <at least in the first two, exploration isn’t trivial, although not extremely difficult>
  25. <Because the VFA is being done linearly, this doesn’t solve the problem of engineering features that allow the problem to be solvable, which is a fundamental issue in continuous RL>
  26. In octopus, reward is the distance from the arm to the target, so there is a nice smooth landscape to optimize on
    1. State space is simplified to 6-D
    2. VFA is done by ANN
  27. <Discussion>
  28. “Using a stochastic policy gradient algorithm, the policy becomes more deterministic as the algorithm homes in on a good strategy.  Unfortunately this makes stochastic policy gradient harder to estimate, because the policy gradient ∇θπθ(a|s) changes more rapidly near the mean.  Indeed, the variance of the stochastic policy gradient for a Gaussian policy N(μ, σ2) is proportional to 1/σ2 (…), which grows to infinity as the policy becomes deterministic.  This problem is compounded in high dimensions, as illustrated by the continuous bandit task.  The stochastic actor-critic estimates the stochastic policy gradient in … The inner integral …[math], is computed by sampling a high dimensional action space.  In contrast, the deterministic policy.”

Evolution Strategies for Direct Policy Search. Verena Heidrich-Meisner, Christian Igel. Parallel Problem Solving from Nature, 2008

  1. Paper discusses use of Covariance Matrix Adaptation Evolution Strategy (CMA-ES) for doing policy search.
    1. This algorithm is a more general form of cross entropy optimization (different rules for the weights can be used)
  2. Comparison is made to Natural Actor Critic (said to be the most sophisticated policy gradient method), and pure random search
  3. A couple of references that put evolutionary alogrithms for policy search ahead of alternative methods
    1. A significant difficulty in comparing policy search to value based methods is that they are so different, so selecting particular classes of FAs can bias comparison
    2. Evolutionary Function Approximation for Reinforcement Learning.  Whiteson, Stone.  JMLR 2006.
  4. Says gradient-based methods generally function poorly in the presence of noise, and domains where there are many poor local maxima
    1. Also claim that because evolution based methods only relatively weigh a number of candidates, they are more resilient to noise than gradient methods:
    2. Variable Metric Reinforcement Learning Methods Applied to the Noisy Mountain Car Problem.  Same authors.  EWRL 2008
  5. A bunch of references for CMA-ES optimization (not RL applications)
    1. A Completely Derandomized Self-Adaptation in Evolution Strategies.  Hansen, Ostermeier.  Evolutionary Computation 2001
    2. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation.  2003
  6. CMA-ES was first used for RL in:
    1. Neuroevolution for Reinforcement Learning Using Evolution Strategies.  Congress on Evolutionary Computation.  2003
  7. The authors also have a couple other papers comparing CMA-ES to policy gradient methods, and CMA-ES is more robust
  8. For some settings of CMA-ES, random search performs better, policy gradient is worse
    1. They blame the failure of policy search on the fact that the fitness landscape has many bad plateaus of equally bad quality, so it ends up effectively doing a local random walk in policy space
  9. Say that aside from all the good qualities of CMA-ES policy gradient can outperform it if initialized close to the solution

Cross-Entropy Optimization of Control Policies with Adaptive Basis Functions. Busoniu, Ernst, De Schutter, Babuska. IEEE Transactions on Systems, Man, and Cybernetics. 2011.

  1. Referenced in PATH INTEGRAL POLICY IMPROVEMENT WITH COVARIANCE MATRIX ADAPTATION as an early application of cross-entropy to continuous spaces
  2. Although this paper works in discrete action spaces.
  3. Alg looks for best closed-loop policy that can be represented using a given number of basis functions, where a discrete action is assigned to each basis function (type and # of BFs are specified a priori)
  4. Compare against the optimization algorithm DIRECT (interesting)
  5. Has citations that says value-function approximation generally requires expert design, or leverages basis functions because it has nice theoretic properties (but poor actual performance).  Attempts have been made to adaptively form basis functions for VF approximation, but this can lead to convergence issues.
  6. Also has citations that say for policy search methods, functions that define policy are generally either ad-hoc or require expert design
  7. Funny that the algorithm can decide where to put the BFs and what action to select from them, but only selects from a discrete set.  Seems trivial to move to continuous actions from here (may see why that is tough in a minute)
  8. Their test domains are the double integrator, bicycle, and HIV
  9. It is compared against LSPI and fuzzy Q, as well as DIRECT optimization of the basis functions
  10. Actor-critic methods perform both gradient-based policy optimization as well as value function approximation
  11. Gradient methods also assume reaching the local optimum is good enough, but in some cases there are many local optima which are not good
    1. This is particularly problematic when the policy representation is rich (many RBFs) as opposed to frugal (few linear functions)
  12. There are other cited methods for finding basis functions for VF approximation that they use as inspiration for doing the same for policies
  13. Convergence of Cross-Entropy is not guaranteed, although in practice it is generally convergent
    1. Although for discrete optimization the probability of reaching optimum can be made arbitrarily close to 1 by using an arbitrarily small smoothing parameter
  14. Argue here that the policy is easier to represent than the value function in many cases
  15. Compared to value function approximators with equally spaced basis functions, CE required fewer BFs, but this is natural – did they compare to VFAs that use adaptive basis functions (they cited it)
    1. Adaptive basis functions allows it to work better in high dimensional spaces
  16. They give  a shout to doing continuous action stuff with the algorithm

Policy Search for Motor Primitives in Robotics. Kober, Peters. Machine Learning 2011.

  1. Cited from to PATH INTEGRAL POLICY IMPROVEMENT WITH COVARIANCE MATRIX ADAPTATION, as an example of continuous RL with policy search
  2. Says that these methods have seen greatest success in large domains where a teacher starts learning close to the solution and then the algorithm completes learning
  3. This is the PoWER (Policy learning by Weighing Exploration with the Returns) paper
  4. Here they rely on special kinds of parameterized policies that are well suited for robotics problems
    1. Methods generally requires some expert assistance.
  5. The use this for swing up, ball-in-a-cup on a real robot arm.
  6. Uses expert instruction here as well
  7. Since these “tasks are inherently single-stroke movements, we focus on a special class of episodic reinforcement learning”
    1. So again, they are using a lot of domain expertise to make the task learnable (not that this is a bad idea when working on a real robot)
  8. The algorithm is EM-based
  9. The math for the algorithm is based on optimizing a lower bound of performance.  They say that this generally is not the way things are done in RL, although it is more common in supervised learning
    1. They show that policy gradient methods can also be derived from this method of derivation, that idea seems to be introduced here
    2. “We show that natural policy gradients can be seen as an additional constraint regularizing the change in the path distribution resulting from a policy update when improving the policy incrementally”
  10. While gradient based methods require a learning rate to be provided (which can be very troublesome to get correct), EM algorithms do not require a learning rate and converge more quickly
  11. In general during policy search, exploration is performed by corrupting the action applied at each time step by some Gaussian noise.  They note, however, that this ultimately can lead to too much change from the true policy, “washing out” the performance of the policy constructed during evaluation.  Because of this, they propose using state and time dependent noise, so the exploration can be more precisely controlled
  12. The construct they use for policy representation is well known for robotic arms and results in a linear policy
  13. To compare against other algorithms, they use a couple of simulations based on a reaching task with a robotic arm
  14. They claim that “kinesthetic teach-in” is required because the test domains are so large, but swingup is quite small (the other however is difficult enough I could see it being very tough), so I don’t buy that explanation.

The Cross Entropy Method for Fast Policy Search. Mannor, Rubinstein, Gat. ICML 2003

  1. According to PATH INTEGRAL POLICY IMPROVEMENT WITH COVARIANCE MATRIX ADAPTATION, this is the first paper that uses cross entropy methods for policy search.
  2. Survey of policy gradient methods at:
    1. Experiments with Infinite-horizon, Policy Gradient Estimation.  JMLR 2001
  3. Connection between stochastic optimization and Q-Learning made in:
    1. Asynchronous Stochastic Approximation and Q-learning.  Tsitsiklis.  Machine Learning 1994.
  4. Standard stochastic optimization is slow because of constraints in the way the update can be performed
  5. The approach here (cross-entropy), on the other hand is designed to be fast, and functions differently than standard stochastic optimization
  6. For background on Cross-Ent optimization, see:
    1. A tutorial on the Cross-entropy method.  de-Boer, Kroese, Mannor, Rubinstein.  Annals of OR
    2. The Simulated Entropy Method for Combinatorial and Continuous Optimization.  Rubenstein.  Methodology and Computing in Applied Probability 1999.
  7. “Good” samples in cross ent are usually referred to as elites
  8. Claim the method is generally robust as long as sample sizes are large and spread out at beginning, some additional smoothing may be used (too fast convergence seems to be an issue that comes up in some of the literature)
  9. A variant is proposed that is more adaptive in choice of generation size and how many samples are elites
  10. In this paper, cross ent is used in a finite MDP to generate a stochastic policy
  11. Here they care about finding a global policy and update policy of every state based on the history after that state was visited during a trajectory
  12. In section 4 they discuss parameterized policies (as opposed to ones that just encode p(a|s) in a matrix)
  13. When using cross entropy this (the normal, parametrizing a policy) way, mu(|s, theta) doesn’t have to be differentiable w.r.t. theta, which is not true in policy gradient algorithms
  14. Alg is tested empirically in a gridworld as well as inventory control (in this problem, the policy is at what stock level is each item re-ordered)
  15. Say that other methods of optimization can be used, but many are sensitive to sampling error (by that I think he means noise).  Says that gradient methods as well as simulated annealing and guided local search all have this problem)

Path Integral Policy Improvement with Covariance Matrix Adaptation. Stulp, Sigaud. ICML 2012

  1. This looks right up my alley.  Discusses policy search methods.  Compares PI^2 to Cross-Entropy and CMAES.
  2. Then proposes a new algorithm PI^2-CMA, whose main advantage is that it determines magnitude of exploration noise automatically
  3. 2 continuous RL papers they cite are:
    1. Policy Search for Motor Primitives in Robotics.  Kober, Peters.  Machine Learning 2011.
    2. A Generalized Path Integral Control Approach to Reinforcement Learning.  JMLR 2010.
  4. Says best algorithms now are Policy Improvment with Path Integrals (PI^2)
    1. The claim is that it outperforms REINFORCE and Natural Actor Critic by an order of magnitude in terms of speed and quality
    2. But at least REINFORCE is total garbage so that is not an impressive claim to be able to make
  5. PI2 is different from policy gradient methods in that it uses probability-weighted averaging to do the parameter update instead of an estimate of the gradient.
    1. CMAES (Covariance Matrix Adaptation-Evolutionary Strategy), as well as CEM (Cross-Entropy Methods) also use this basic idea
  6. Although all 3 algorithms have same basic component, each arrived at the rule from different principles
  7. By reinterpreting CEM as performing probability-weighted averaging, it can be shown that CEM is a special case of CMAES
  8. For the Cross-Entropy method, a particular form of probability-weighted averaging is performed, where the “good” samples get probability 1/k (if there are k of them), and the “bad” samples get probability 0.
  9. CEM for policy optimization was introduced in:
    1. The Cross-Entropy Method for Fast Policy Search.  Mannor, Rubenstein, Gat. ICML 2003.  This was primarily for finite MDPs, they mention the extension to continuous spaces
  10. Cross-Entropy in continuous spaces was more thoroughly introduced in:
    1. Cross-Entropy Optimization of Control Policies with Adaptive Basis Functions.  Busoniu, Ernst, Schutter, Be, Babuska.  IEEE Transactions on Systems, Man, and Cybernetics.  2011.
  11. CEM has also been used with sampling-based motion planning:
    1. Learning Cost-efficient Control Policies with XCSF: Generalization capabilities and Further Improvement.  Marin, Deock, Rigoux, Sigaud.  Genetic and Evolutionary Computation
  12. CMAES is very similar to CEM, except it uses a more sophisticated method to update the covariance matrix
    1. There is also an extra variable that scales the covariance matrix and therefore controls exploration
    2. It also keeps track of the history of parameter changes and uses this “evolution path” to help speed convergence based on correlations between generations
  13. For a description of CMAES, see:
    1. Completely Derandomized Self-adaptation in Evolution Strategies.  Hansen, Ostermeier. Evolutionary Computation 2001.
  14. For example of CMAES applied to double pole balancing, see:
    1. Evolution Strategies for Direct Policy Search.  Heidrich-Meisnerm, Igel.  Parallel Problem Solving from Nature 2008
  15. In empirical results, uses Dynamic Movement Primitives
    1. A Generalized Path Integral Control Approach to Reinforcement Learning.  Theodorou, Buchli, Schaal.  JMLR 2010
  16. “Using parameterized policies avoids the curse of dimensionality associated with (discrete) state-action spaces, and using probability-weighted averaging avoids having to estimate a gradient, which can be difficult for noisy and discontinuous cost functions”
  17. Seems like PI^2 has high computational costs relative to something like Cross-Entropy
  18. Also discusses PoWeR algorithm, but says performance is basically the same is PI^2
  19. “PI^2’s properties follow directly from first principles of stochastic optimal control.”  On the other hand, CMAES and CEM have less formal backing
  20. The evaluation task is a time-dependent 10-DOF arm, previously used in the Theodorou paper
  21. Unlike in CEM/CMAES, the covariance matrix in PI^2 does not vary; only the mean does.  This is done in order to have the derivation of PI^2 go through.
    1. But in this paper they ignore that and use adaptive covariance as well
    2. Which is what they call PI^2-CMA, Path Integral Policy Improvement with Covariance Matrix Adaptation.
    3. Forming PI^2-CMAES is analagous but slightly more complex
  22. PI^2 has much slower convergence than the variance proposed in this paper
  23. They have results for how initial parameterizations impact perf, doesn’t seem to matter too much

Learning, Inference, and Control for Robotics and Sustainable Energy. J. Zico Kolter. Talk

  1. Gives example of little dog – easy to model kinematics, but full model is difficult with uncertain terrain, similar example with driving on poor surfaces
  2. Can make the same argument about wind turbines (hard to model) or what happens in te home (hard to control)
  3. But data is easy to get
  4. 2 parts: data driven learning and control for dynamic tasks, and data driven control for sustainable energy


  1. Say its very diificult to accurately make a model of robot dynamics from pure physics, so better to use data to (help) make a model
  2. Say many planning problems can be helped just by looking at the sign of derivative terms. Do gradient descent just on that
  3. Does this form of policy gradient to teach little dog to climb steps in about 5 minutes
  4. Same issue with drift-parking. Cant use a dynamic model based on phystics because it misses particular bits, so you should combine observed data
  5. Idea is dynamics is hard to model, but maneuver is repeatable over short horizons.  So parts can be open-loop
  6. Result is molti-model LQR:
    1. Use predictions errors over data to estimate model variance
    2. Use variance- aware method (new iterative LQR method) to compute optimal controls


  1. Generating energy from wind-turbines with a data-driven control approach (control right pitch of blades)
  2. The models we have for wind dynamics is not accurate, and operates in very restricted conditions.  They really suck
  3. Because of this online optimization is important.  Go about doing stochastic optimization
  4. Care about data efficiency, satisfied with local optimim
  5. “Trust region policy search” – use second order (Hessian) info to optimize.  Update param values by trust-region
    1. Need to estimate Hessian, which is difficult, but can do important sampling on previous results to reduce the # of samples needed
    2. Hessian may be indefinite, so use a trus region solver – fits a polynomial only locally as opposed to globally.  This can be solved exactly
    3. Use something based on variance of gaussian used to sample to pick region
  6. Beats up REINFORCE badly (but even back in the day REINFORCE was known to be a very sample inefficient algo), but indepenedent of that it does quickly climb up to the optimal region
  7. Idea is to use the power consumption of the entire home at the power meter coming in instead of monitorig each outlet independently
  8. Uses HMMs to model whether the state of each device is on or off.  Problem is current algs can’t deal with input sizes as large as what occurs in a home
  9. Do spectral clustering on the data to identify what is actually happening in the house
  10. Need a new alg to do tractable inference.  It is a convex approx inference methods that can be quickly solved for hudreds of thousands of variables

Nested Rollout Policy Adaptation for Monte Carlo Tree Search. Christopher D. Rosin. IJCAI 2011 Best Paper

  • Most MCTS algorithms use a fixed policy, this algorithm adapts the rollout policy during search
  • Based on an earlier Nested Monte Carlo Search
    • But here use gradient ascent on rollout policy at each level of the “nested search” (whats that)
    • Test on crossword puzzle construction and solitare, generates state of the art results for these domains  (the previous record in solitaire was human generated and had stood for 30 years)
    • Concerned with deterministic optimization problems here.
    • Says in general, nested Monte Carlo search generates state of the art solutions in deterministic problems… what is that?
    • Search is initiated at a nesting level n, which is the depth of the recursion for the nested search.
      • At each nesting level, the algorithm searches for a solution sequence (a sequence of states that leads to a leaf)
      • When attempting to refine its policy, nesting level n calls level n-1, bottoming out at 0
      • Says when nesting succeeds, level n is substantially better than n-1
      • Doesn’t require smoothness or evaluation heuristics, but can benefit from them
      • It can be DAG-ified
      • “NRPA may not obtain robust policies that give viable results starting from arbitrary nodes in the tree.  Instead, NRPA’s policies emphasize actions taken by the best solutions so far”
      • The parameterization did not change across application domains
      • Took about a day of processing to beat  a previous result from a different algorithm that took 38 days to run

————From video of talk

  • Effectively “learns” heuristic, so good for case there is no good heuristic
    • Dyna-2 is something that sort-of has the same property?
    • Policy is describable as some combination of parameters (may be enormous, tens of thousands)
    • Don’t care about values of nodes cares about actions.   Identify actions that occur across successful rollouts.
    • Use structure of domain to identify successful rollouts
    • Rollouts always start from root
    • Policies are stochastic sort of like softmax: p(a) = e^policy(code(a)), similar thing has been used for exploration in Go in the past.
    • Level 1 search:
      • Do n iterations
        • Keep track of best policy found
        • Update policy by gradient on log-prb of best policy
        • Gradient exponentially quick moves to best, so iterations in experiments was small (100)
        • Initially policy is totally random, but policy adapts by gradient to best, and does local optimization and then it finishes
        • The point of nesting is to cause more global search to be conducted
        • For level L>1 search:
          • Do n iterations
            • Call algorithm for the level below (L-1)
            • Keep track of best
            • Adapt policy by gradient method
  • Each level takes n times as long as the level below it, so usually can’t do better than 5 levels
  • I’m still not 100% on this
  • The difference between the two domains tested is one is wide, the other is deep (although both seem very large)
  • Seems like level 1 is initialized totally at random, when run by itself, but when run by level-2 the restarts aren’t totally random; they are taken from the distribution (policies are stochastic) found by the policy at level 2
  • Seems like the way this differs from the nested monte carlo search work it is based on is the use of gradient methods to adjust policy
  • Not directly applicable to 2-player games, he was motivated by go but hasn’t applied ti to that yet
  • Doesn’t maintain statistics over the search tree, just the policy. Memory use is probably minimal then
  • There are guarantees for performance in paper

Black Box Policy Search in the Inverse Pendulum

While working on the action sequence search vs. policy parameter search, I noticed a few things:

  1. Action sequence planning with replanning seemed the most robust
  2. Action sequence planning without replanning seemed to be about as bad as random
  3. In the policy parameter search, using the parameter setting that resulted in the single best run (I call this greedy) seemed to work better than asking the HOO algorithm to find the best parameterization (I call this Hoo-greedy, by greedily following the means down the tree)

This left me with a number of questions, one of the bigger ones being what the impact of noise is on the greedy vs. hoo-greedy policies.  I ran all four methods (action sequence and policy parameter search with/out replanning) in the inverted pendulum domain with varying amounts of noise.  The full action range in the domain is from -50 to 50 units, and in the experiments I ran the actions were corrupted by noise uniformly distributed from 0 to +/- 4 units of noise.  The results are below:

Action sequence search with replanning

Action sequence search without replanning

Policy parameter search with replanning

Policy parameter search without replanning

So from here the new basic takeaways are in the inverted pendulum domain:

  1. Action sequence with replanning works close to optimal with the greedy or hoo-greedy results regardless of noise
  2. Parameter search with replanning works close to random performance either way.  This is good because in the double-integrator domain the results between parameter search and action sequence search with replanning were on top of each other.
  3. Parameter search without replanning only seems to work when the greedy parameter settings are chosen, but that only works in domains with no noise at all.  Even very small amounts of noise seem to cause trouble in this domain

Aside from that, I’m not sure to say.  Unfortunately, I can’t really think of anything concrete to pull from the experiments here and in the double integrator so I think I’ll keep running experiments and see what other things seem to come up.

Update: a similar graph rolled in 1 for the double integrator domain.  Its a bit of a mess, but action sequence search with replanning use the Hoo-suggested action still comes out on top.  There are some gnarly error bars because I only ran a few trials of each as they take a bit of time:

Performance of all the approaches in the double integrator

Another update, here are results of a similar experiment in the Ball-Beam domain:

Action sequence search with replanning

Action sequence search without replanning

Policy parameter search with replanning

Policy parameter search without replanning

Another update, here are results of a similar experiment in the Bicycle domain:

Action sequence search with replanning

Action sequence search with replanning

Action sequence search without replanning

Action sequence search without replanning

Policy parameter search with replanning

Policy parameter search with replanning

Policy parameter search without replanning

Policy parameter search without replanning

Different Approaches to Policy Search

Our last email ended with Michael Saying:

Have you (or could you) fill in a 2×2 table that looks something like this?

1a. Behavior = parameterized policy, planning performed = compile time
1b. Behavior = parameterized policy, planning performed = run time (sequential)
2a. Behavior = action sequence, planning performed = compile time
2b. Behavior = action sequence, planning performed = run time (sequential)

I’d call 1a “classical” policy search and 2b HOLOP.  I’d compare them in terms of computation performed beforehand and computation performed during the run and effectiveness (in terms of reward).  I’d hypothesize that the as beat the bs in terms of computational demands during the run, but the bs beat the as in terms of computational demands before the run. I’d also expect to see a smaller difference between the effectiveness of 1b and 2b than of 1a and 2a.  That is, given that you are going to do run-time decision making, the less expensive action sequence representation is sufficient for realistic problems.  Ideally, the run-time resources needed for 2b are not substantially greater than those of 1a, the compile time resources are much better, and the effectiveness is similar.

I just finished up implementing and running these different approaches.  Here’s a picture:

So more of an explanation.  The x-axis represents the number of double integrators (DIs) simultaneously controlled, the y-axis represents the avg. cumulative reward of each agent.  Episodes were 200 steps long, and the planners were given access to 100 roll-outs from a generative model for each step in the episode.  1a and 2a took all 200*100 rollouts upfront and attempted to find an optimal policy, while 1b, 2b replanned from scratch and used 100 rollouts per step.

All algorithms used HOO to search the parameter space.  In 1a and 1b, HOO was used to find weights of a neural net encoding the policy of an agent.  The structure of the net was to have |S|+1 inputs, 2|S| hidden nodes, and |A| output nodes.  Accordingly, the size of the neural network grows polynomially with the number of DIs controlled.  1a and 1b yielded very poor (close to random) performance when the final weights were selected by HOO according to a greedy search down the trees by the mean.  In the graph the performance shown is the result of simply taking the weights from whichever run yielded the maximum reward, which is safe to do in a deterministic domain but not in a stochastic one (which this was).

For 2a, 2b an action sequence was found by HOO/HOLOP and then the “best” policy was selected by greedily following nodes down the HOO tree, which yielded better results in that case.

The most interesting result I find is that classical policy search (1a) starts out with the best performance but quickly degrades to random performance as the number of parameters describing the policy grows.  1b and 2b, which do replanning are much more tolerant of the increasingly difficulty of the task, probably because of the fact that they aren’t trying to find a good representation of the policy globably, and can recover from a few poor policy searches by later ones later on.  2a, which is HOLOP without planning is quite bad, but had very low variance in returns (not graphed here), with close to constant performance, and at least managed to stay above random performance.

I know this isn’t the graph you asked for, as you wanted to compare them between computation performance before and during the run, but I’m not sure how to really show that.  With the a’s,  very close to 100% of the computation is done before and very close to 0% is done during the run, and vice-versa with the b’s, so I don’t think its interesting in really graphing?