Model-Based Reinforcement Learning in Continuous Environments Using Real-Time Constrained Optimization. Andersson, Heintz, Doherty. AAAI 2015 | Ari Weinstein's Research

Model-Based Reinforcement Learning in Continuous Environments Using Real-Time Constrained Optimization. Andersson, Heintz, Doherty. AAAI 2015

Working on high-D continuous RL

Builds a model with sparse Gaussian processes, and then does local (re)planning “by solving it as a constrained optimization problem”

Use MPC/control related methods that were done back in ’04 but revisited here and can be used for real-time control now

Test in “extended” cart-pole <all this means here is the start state is randomized> and quadcopter

Don’t try to do MCTS, because it is expensive. Instead use gradient optimization

Instead of normal O(n^3) costs for GPs, this has O(m^2n), whre m < n

“However, as only the immediately preceding time steps are coupled through the equality constraints induced by the dynamics model, the stage-wise nature of such modelpredictive control problems result in a block-diagonal structure in the Karush-Kuhn-Tucker optimality conditions that admit efficient solution. There has recently been several highly optimized convex solvers for such stage-wise problems, on both linear (Wang and Boyd 2010) and linear-timevarying (LTV) (Ferreau et al. 2013; Domahidi et al. 2012) dynamics models.”

Looks like the type of control they use has to linearize the model locally

“For the tasks in this paper we only use quadratic objectives, linear state-action constraints and ignore second order approximations.”

Use an off-the shelf convex solver for doing the MPC optimization

Use warm starts for replanning

The optimization converges in a handful of steps

<Say they didn’t need to do exploration at all for the tasks they considered, but it looks like they have a pure random action period at first>

Although the cart-pole is a simple task, they learn it in less than 5 episodes

<But why no error bars, especially when this experiment probably takes a few seconds to run. This is crazy in a paper from 2015, although it is probably fine it makes me wonder if it sometimes fails to get a good policy>

Use some domain knowledge to make learning the dynamics for the quadcopter a lower-dimensional problem

8D state, 2D action

For quadcopter there is training data from a real quadcopter? fed in and then it is run in simulation

“By combining sparse Gaussian process models with recent efficient stage-wise solvers from approximate optimal control we showed that it is feasible to solve challenging problems in real-time.”