Model-Based Reinforcement Learning in Continuous Environments Using Real-Time Constrained Optimization. Andersson, Heintz, Doherty. AAAI 2015

  1. Working on high-D continuous RL
  2. Builds a model with sparse Gaussian processes, and then does local (re)planning “by solving it as a constrained optimization problem”
  3. Use MPC/control related methods that were done back in ’04 but revisited here and can be used for real-time control now
  4. Test in “extended” cart-pole <all this means here is the start state is randomized> and  quadcopter
  5. Don’t try to do MCTS, because it is expensive.  Instead use gradient optimization
  6. Instead of normal O(n^3) costs for GPs, this has O(m^2n), whre m < n
  7. “However, as only the immediately preceding time steps are coupled through the equality constraints induced by the dynamics model, the stage-wise nature of such modelpredictive control problems result in a block-diagonal structure in the Karush-Kuhn-Tucker optimality conditions that admit efficient solution. There has recently been several highly optimized convex solvers for such stage-wise problems, on both linear (Wang and Boyd 2010) and linear-timevarying (LTV) (Ferreau et al. 2013; Domahidi et al. 2012) dynamics models.”
  8. Looks like the type of control they use has to linearize the model locally
  9. “For the tasks in this paper we only use quadratic objectives, linear state-action constraints and ignore second order approximations.”
  10. Use an off-the shelf convex solver for doing the MPC optimization
  11. Use warm starts for replanning
  12. The optimization converges in a handful of steps
  13. <Say they didn’t need to do exploration at all for the tasks they considered, but it looks like they have a pure random action period at first>
  14. Although the cart-pole is a simple task, they learn it in less than 5 episodes
    1. <But why no error bars, especially when this experiment probably takes a few seconds to run.  This is crazy in a paper from 2015, although it is probably fine it makes me wonder if it sometimes fails to get a good policy>
  15. Use some domain knowledge to make learning the dynamics for the quadcopter a lower-dimensional problem
    1. 8D state, 2D action
  16. For quadcopter there is training data from a real quadcopter? fed in and then it is run in simulation
  17. “By combining sparse Gaussian process models with recent efficient stage-wise solvers from approximate optimal control we showed that it is feasible to solve challenging problems in real-time.”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: