Category Archives: Continuous action spaces

Model-Based Reinforcement Learning in Continuous Environments Using Real-Time Constrained Optimization. Andersson, Heintz, Doherty. AAAI 2015

  1. Working on high-D continuous RL
  2. Builds a model with sparse Gaussian processes, and then does local (re)planning “by solving it as a constrained optimization problem”
  3. Use MPC/control related methods that were done back in ’04 but revisited here and can be used for real-time control now
  4. Test in “extended” cart-pole <all this means here is the start state is randomized> and  quadcopter
  5. Don’t try to do MCTS, because it is expensive.  Instead use gradient optimization
  6. Instead of normal O(n^3) costs for GPs, this has O(m^2n), whre m < n
  7. “However, as only the immediately preceding time steps are coupled through the equality constraints induced by the dynamics model, the stage-wise nature of such modelpredictive control problems result in a block-diagonal structure in the Karush-Kuhn-Tucker optimality conditions that admit efficient solution. There has recently been several highly optimized convex solvers for such stage-wise problems, on both linear (Wang and Boyd 2010) and linear-timevarying (LTV) (Ferreau et al. 2013; Domahidi et al. 2012) dynamics models.”
  8. Looks like the type of control they use has to linearize the model locally
  9. “For the tasks in this paper we only use quadratic objectives, linear state-action constraints and ignore second order approximations.”
  10. Use an off-the shelf convex solver for doing the MPC optimization
  11. Use warm starts for replanning
  12. The optimization converges in a handful of steps
  13. <Say they didn’t need to do exploration at all for the tasks they considered, but it looks like they have a pure random action period at first>
  14. Although the cart-pole is a simple task, they learn it in less than 5 episodes
    1. <But why no error bars, especially when this experiment probably takes a few seconds to run.  This is crazy in a paper from 2015, although it is probably fine it makes me wonder if it sometimes fails to get a good policy>
  15. Use some domain knowledge to make learning the dynamics for the quadcopter a lower-dimensional problem
    1. 8D state, 2D action
  16. For quadcopter there is training data from a real quadcopter? fed in and then it is run in simulation
  17. “By combining sparse Gaussian process models with recent efficient stage-wise solvers from approximate optimal control we showed that it is feasible to solve challenging problems in real-time.”

Deterministic Policy Gradient Algorithms. Silver, Lever, Heess, Degris, Wierstra, Riedmiller. ICML 2014

  1. Deterministic policy gradient for continuous action MDPs
    1. Deterministic in that the action selected by the algorithm is deterministic as opposed to a stochastic
    2. Not a function of stochasticity in the domain <so far>
  2. “The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function.  This simple form means that the deterministic policy gradient can be estimated much more effectively than the usual stochastic policy gradient.”
  3. Exploration is driven by off-policy actor-critic
  4. This method is shown to beat stochastic policy gradient in high-D spaces
  5.  “It was previously believed that the deterministic policy gradient did not exist, or could only be obtained when using a model (Peters, 2010).  However, we show that the deterministic policy gradient does indeed exist, and furthermore it has a simple model-free form that simply follows the gradient of the action-value function.”
  6. The determinstic policy gradient is simply equivalent to the stochastic policy gradient as policy variance approaches 0.
  7. The basic difference between the deterministic vs stochastic policy gradient is that the first only integrates over state, while the latter also integrates over actions (so doing stochastic requires more samples <and similarly is more error-prone>)
  8. The benefit of stochastic policy gradient is that it has a nice way of doing exploration built-in.  Here an actor-critic setup is used to drive exploration: “We use the deterministic policy gradient to derive an off-policy actor-critic algorithm that estimates the action-value function using a differentiable function approximator, and then updates the policy parameters in the direction of the approximate action-value gradient.  We also introduce a notion of compatible function approximation for deterministic policy gradients, to ensure that the approximation does not bias the gradient.”
  9. Experiments on a high-D bandit, some other stuff, and octopus arm
  10. “… compatible <in that the critic does not introduce bias> function approximators are linear in ‘features’ of the stochastic policy….”
  11. Discusses off-policy actor critic, which is a combination of policy gradient and TD, provides an approximation of the true gradient
    1. Has to use importance sampling to compensate for the fact that the state distribution that is being observed and the one the alogrithm would like to generate are different.
  12. “In continuous action spaces, greedy policy improvement becomes problematic, requiring a global maximization at every step.  Instead, a simple and computationally attractive alternative is to move the policy in the direction of the gradient of Q, rather than globally maximizing Q.”
  13. The estimate of the action parameterization-value gradient depends on the distribution of states visited, but it turns out the gradient of the state distribution does not have to be calculated
  14. Naturally, as in the case of stochastic case, a differentiable estimate of the Q function must be used.
    1. In general, this will not preserve the gradient of the true value function (it is called compatible) but there are classes of FAs that will
  15. Start with a description of gradient SARSA
  16. Then move to off-policy deterministic actor critic
  17. Linear FAs are compatible, and can be effective if they only have to locally dictate how to adjust parameters <but is it used only locally?>
    1. Seems like it can be linear in any set of basis functions, though
  18. Minimizing squared-error
  19. “To summarise, a compatible off-policy deterministic actor-critic (COPDAC) algorithm consists of two components.  The critic is a linear function approximator that estimates the action-value from features [math]… This may be learnt off-policy from samples of a behaviour policy β(a|s), for example using Q-learning or gradient Q-learning.  The actor then updates its parameters in the direction of the critic’s action-value gradient.”
  20. Although off-policy QL may diverge when using linear VFA, there are now methods that are safer, which is what is used here
  21. Computational complexity is mn where m = |A| and n is the number of policy parameters <which may be |S||A|?>
  22. On to experimental results
  23. High-D (D = 10, 25, 50) quadratic bandit.  Seems like a pretty trivial problem – perf in the 50D case converges at around 1000 samples.  Stochastic is still almost at exactly the same performance it started with at that point
  24. Then work in mountain car, pendulum, puddle world <at least in the first two, exploration isn’t trivial, although not extremely difficult>
  25. <Because the VFA is being done linearly, this doesn’t solve the problem of engineering features that allow the problem to be solvable, which is a fundamental issue in continuous RL>
  26. In octopus, reward is the distance from the arm to the target, so there is a nice smooth landscape to optimize on
    1. State space is simplified to 6-D
    2. VFA is done by ANN
  27. <Discussion>
  28. “Using a stochastic policy gradient algorithm, the policy becomes more deterministic as the algorithm homes in on a good strategy.  Unfortunately this makes stochastic policy gradient harder to estimate, because the policy gradient ∇θπθ(a|s) changes more rapidly near the mean.  Indeed, the variance of the stochastic policy gradient for a Gaussian policy N(μ, σ2) is proportional to 1/σ2 (…), which grows to infinity as the policy becomes deterministic.  This problem is compounded in high dimensions, as illustrated by the continuous bandit task.  The stochastic actor-critic estimates the stochastic policy gradient in … The inner integral …[math], is computed by sampling a high dimensional action space.  In contrast, the deterministic policy.”

Optimal Learning for Sequential Sampling with Non-parametric Beliefs. Barut, Powell. Journal of Global Optimization?

[this is based on a quick 2nd read through, so less notes than usual]

  1. Aggregates over a set of kernel functions in order to do ranking and selection more effectively 
  2. The weighing scheme used is shown to be optimal under independent kernel estimators (need to find out what that means exactly)
  3. Uses knowledge gradient to select sampling points (roughly uses GPs and then estimates gradient of the bounds, and then attempts to sample where the gradient is highest, but a separate optimization algorithm is needed to find those locations)
  4. Proposed policy is shown to be asymptotically optimal
  5. This paper is concerned with searching over a finite number of options (computationally, restricted to some thousands)
  6. Similarity is measured by a kernel.  Although there is no explicit use of lipshitz smoothness or concavity, kernel metrics must be Holder
  7. In the off line setting, the problem addressed is called ranking and selection, but in on-line its called multi armed bandit
  8. References for earliest papers on stochastic search
  9. Noise is assumed to be normally distributed
  10. Policy is 1-step optimal, and converges to optimal at the limit
  11. In their empirical results they use 0-mean 1D gaussian processes.  They test this method, random sampling, and another GP-based method (which attempts to find the length-scale online (SKO)? theirs just weighs a bunch of provided kernels, finding the best one)
    1. For differing kernels, used linear fits but with different bandwidths
    2. Their stuff does very well, although with larger length-scales SKO works better.  For GPs that actually have very short length-scales, KGCB is better.
  12. Also, SKO is less general, and only works well on certain classes of functions (certain covariance functions cause trouble, for example).  Seems to be a little more fragile.
  13. I respect them for pointing out situations where SKO outperforms KGCB
  14. In an energy storage/retrieval problem, their method outperforms SKO, although with larger amounts of noise the bounds begin to overlap.  In this problem KGCB converged more quickly, while in some of the earlier results SKO converged more rapidly, so depends on the domain.
  15. “Although our policy performs very well in the numerical experiments, there is a caveat.  Kernel estimation is known to suffer from the curse of dimensionality as the MSE proportional to h^d where h is the bandwidth and d is the number of dimensions.  If observations lie in high dimensional spaces, non-parametric estimation is known to have poor performance.  Because of these reasons, the efficiency of our estimation method also degenerates in higher dimensions.  Additive models might be used to handle this curse but this requires making more assumptions on the structure of the function.”

Qualitative Hybrid Control of Dynamic Bipedal Walking. Ramamoorthy, Kuipers. RSS 2006.

  1. Approach is concerned with dynamic walking on uneven terrain
  2. A common approach is to formulate the walking problem as a linear system which can be solved with a number of methods.  This is problematic because it reduces energy efficiency, overly constrains the type of gaits that can be found, and requires constant compensation for what the body naturally does (that introduces non linearity) which introduces other forms of complexity
  3. Dynamic walkers utilize, instead of overcome the inherent nonlinearities involved in legged locomotion
  4. The idea is to only introduce small corrections into the trajectories that will naturally occur in the system
  5. Discuss motivation based on studies of how animals walk, and how humans learn to walk.  Idea is that planning is done in some simpler subspace and then mapped back to the original problem.  This paper is motivated by trying to find simple methods of walking on uneven terrain that can then be mapped back to more complex versions of the problem.
  6. Talk about Poincare maps and limit cycles for walking on smooth terrain, but these approaches don’t transfer well to walking on uneven terrain.
  7. Method here proposes to use trajectory segments that can be searched over in order to walk on complex terrain
  8. They present their mathematical representation of a pendulum walker – three point masses, each at the hip and feet
  9. In the swing phase, one foot is planted and there are only two moving masses, which is actually equivalent to the acrobot (only the hip is actuated).  The distinction is that it is assumed the legs can retract or extend instantaneously to clear obstacles.
  10. Here, impulses are only provided during double support (when both feet contact the ground), so the problem is very highly constrained.
  11. It is modeled as a discrete system, with systems being when one or two feet are planted on the ground
  12. Talk about the fact that a driven pendulum is chaotic, so getting everything to synchronize correctly (between the two separate systems of swinging and standing, and dealing with complications due to obstacles) may be difficult
    1. The transition to chaos is gradual, “As the driving amplitude is smoothly varied from a small value, only a few trajectories near critical points and the separatrix are affected.”
  13. They want to try and avoid situations that could become chaotic, do so by maintaining constraints that keep the system non-chaotic
  14. Again, forces are only applied during the double support phase.  After that the leading leg is retracted (and presumably later expanded)
  15. The method requires solution of a sequential quadratic program, and the method is admittedly not appropriate for on-line use, and then using some method of storing the policy
  16. The result settles into a stable limit cycle when walking on flat terrain
  17. Have some experiments walking on terrain that is flat, inclining, and of random heights

STOMP: Stochastic Trajectory Optimization for Motion Planning. Kalakrishnan, Chitta, Theodorou, Pastor, Schaal

  1. Another planner that uses optimization to minimize cost
  2. Performs local search by starting with an initial candidate trajectory and then introducing noise to sample nearby it:
    1. STOMP is an algorithm that performs local optimization, i.e. it finds a locally optimum[al] trajectory rather than a gobal one.  Hence, performance will vary depending on the initial trajectory used for optimization.  STOMP cannot be expected to solve typical motion planning benchmark problems like the alpha puzzle in a reasonable amount of time.

  3. Compares itself to the algorithm CHOMP, which is similar but CHOMP is a gradient-based method.  Because STOMP is not a gradient method, and has some randomized exploration, it has a better ability to escape local optima than CHOMP
  4. Because it is not a gradient method, STOMP can be used with wider classes of cost functions (such as those hard constraints on angles or torques which many method have trouble with), and more general constraints
    1. Hard limits can also cause local optima, so while the method doesn’t just break in these settings, it is possible convergence will not be to optimal resutls
  5. Basic motion planning algorithms generally require smoothing and other further transformations after a solution is found in order to maintain energy efficiency, minimize jerk (and motor wear), among humans abrupt changes caused by classical motion planning methods can be disconcerting.
    1. A result from a classical motion planner could be used as the initial proposal for this algorithm
  6. In contrast to the classical motion planning approach (which they call sampling-based), are optimization based planners which minimize cost
  7. Seems like this method also uses inverse kinematics, runs on a domain that does is *not* underactuated or exhibit drift: “Specifically, we consider trajectories of a fixed duration T, discretized into N waypoints, equally spaced in time.”
  8. Also does an EM-style update, and weighs samples according to their goodness: “The trajectory updates can be considered safer than a standard gradient descent update rule.  The new trajectory is essentially a convex combination of the noisy trajectories which already been evaluated, i.e. there are no unexpected jumps to unexplored parts of the state space due to a noisy gradient evaluation.”
  9. The only open parameter to the algorithm is the exploration noise
  10. In settings where optimization is done in a multidimensional space, each dimension is dealt with independently
  11. They do a  grasping task on a PR2 in simulation and on the real robot
    1. CHOMP fails to produce an effective plan about 30% of the time, while STOMP is always successful, probably due to the bit of randomized exploration STOMP does.
  12. They mention a paper that introduces noise into CHOMP to help get around local minima, but that it was difficult to use and implement, with many parameters that needed to be tuned

Progress in Algorithmic Motion Planning and Opportunities at the Intersection with Perceptual Science. Kostas Bekris talk.

  1. Interests:
    1. High-quality plans that
    2. Respect of physical constraints of the system
    3. Real time constraints are important
      1. Real-time planning may require short planning horizons which can lead to situations where collision cannot be avoided
    4. Multi-agent planning
    5. Interaction with humans
  2. Methods cited for control theory: LQR, partial feedback linearization, HJB and pnotyagin’s principle
  3. In late 70s, PSPACE-Hardness was established (Reif ’79, Schwartz & Sharir 82, 84)
  4. Cites Canny’s work (I just grabbed his thesis from the library), talked about roadmaps, but the algorithm presented was doubly-exponential, never even implemented.
  5. PRMs were introduced by Kavarki, Bekris’ advisor in 1996.  An efficient way of producing roadmaps
  6. In the general case, planning without respecting dynamics and then trying to smooth the path so it complies with dynamics is not possible
  7. He is interested in finding more computationally near-optimal results as opposed to optimal results at the limit with infinitely dense graphs
    1. Do dense planning and then prune
  8. His general goal is PAC-style planning
    1. The results in the field are almost entirely at the limit
  9. Under differential constraints, PRMs are difficult because of the BVP problem, RRTs dont have that problem
  10. Methods of encouraging efficient exploration (otherwise acrobot for example just hangs down with random actions)
  11. Can do replanning in domains that are non-static
  12. Multi-agent path planning is poly-time (non-optimal) in discrete domains
  13. Future work in area of co-robotics (robot, human interaction), maybe from perspective of game theory, but want pareto optimality, introduces a number of other problems, of course
  14. They may get a BAXTER(!)
  15. Minimax regret: Savage ’51, Niehas ’48
    1. Good for human interaction, because minimax results agree more with human intuition than game-theoretic results

Cross-Entropy open loop planning applied to noisy humanoid walking

In this experiment, both the domain and the generative model are noisy.  Noise is introduced by applying  a uniformly distributed noise to one randomly selected joint at each time step.  The noise is not “well-behaved” as it is not gaussian, and not zero-mean.  I believe the results look better than the previous post, where the domain was noisy, but the generative model was not.

Sufficient Conditions for the Existence of Resolution Complete Planning Algorithms. Yershov, LaValle. Proceedings Workshop on Algorithmic Foundations of Robotics (WAFR), 2010

  1. Kobilarov cited this paper saying the following: “Conditions for resolution completeness of planning with such primitives [parameterizations of open-loop control with actions and corresponding time durations] have been established.”
  2. Establishes metric spaces over the control and trajectory spaces
  3. Based on these metrics, if the domain is Lipshitz, then the mapping between open loop control and trajectories is continuous
  4. Because of this, it is possible to search the space of actions, as opposed to building a reachability graph.
  5. This algorithm has guarantees like most motion planning algs, which is that if a solution exists, it will be found eventually
  6. Not reading this carefully, because I think I already got what I need out of this paper.  Most of it is proofs, which I dont really care about – just what is proved.
  7. Requires continuity, which goes with Lipshitz smoothness
  8. Algorithm is very simple
  9. The algorithm effectively deals with noise, as there is a term that deals with error that arises from integration errors.

Anytime Computation of Time-Optimal Off-Road Vehicle Maneuvers using the RRT*. Jeon, Karaman, Frazzoli. Decision and Control.

  1. An empirical application of RRT*, this paper finds trail-breaking as something that naturally emerges from RRT*
  2. I was excited because I thought this application would use stochasticity, but looks like it is a deterministic simulation.
    1. They give the dynamics equations
  3. There have been methods that applied numerical optimization from optimal control theory to this domain, but it requires “careful transcription to numerical optimization formulations.  Most algorithms of this class must be started with a feasible initial solution, which is usually hard to generate in the first place.”
    1. Also some suffer from local optimality (may be same as requiring initial solution)
    2. Cites applications of optimal control to the domain, as well as a paper that compares other methods
  4. In this application, they plan in the “task space”, which is some lower dimensional representation of the original problem that doesn’t throw out information
  5. Plan in state space, and then require a method which will near-optimally nearly-steer between the two states
    1. The one they use is based on shooting, the one they use requires another function that subdivides the input and increments or decrements it as needed
    2. That is a pretty difficult problem in itself
    3. Reminds me of binary action search
  6. Uses branch and bound/admissible heuristics
  7. Approach requires distance metric, as all RRT methods do

Adaptive Path Planning in a Dynamic Environment using a Receding Horizon Probabilistic Roadmap: Experimental Demonstration

  1. If dynamics and constraints are nonlinear, there is no closed-form analytical solution
  2. Gradient and potential field methods can be useful in nonlinear domains, with restricted observability, but are strongly influenced by initial search location, as nonlinear constraints can introduce many maxima/minima
    1. Gradient methods may require special differentiation because of non-smooth constraints
  3. The main idea of the paper is to use PRMs to build a global connectivity graph, which may need to have edges removed as the agent only senses obstacles nearby, so replanning is still needed as locations that were thought to be free turn out to be closed
    1. Planning is done purely in state space, then motion primitives stitch the states together.
  4. They plan in a quadratic cost function
  5. The use a helicopter domain that they provide simple dynamics equations for
  6. The motion primitives used for planning ensure the helicopter does not enter a bad state
  7. They also tested the approach on an actual helicopter