- The first time a helicopter autonomously performed a number of maneuvers, also the first time these can be performed sequentially without a break to reset in between
- The learning process is as follows:
- Use apprenticeship learning to collect data, as other forms of exploration are too dangerous to the helicopter
- Based on that data, a simulator and a policy is built in simulation
- If that policy is effective on the actual helicopter, stop. Otherwise, include the data from that failed actual run in the model and perform #2 again

- According to “Exploration and Apprenticeship Learning in Reinforcement Learning” Abeel, Ng, this will converge to expert performance in a polynomial number of iterations
- They come up with a linear model for the dynamics
- Control is done at the 0.1 s time scale
- They use DDP for planning which works as follows:
- Compute a linear approx of the dynamics (according to Erez’ thesis, it can be quadratic, but I guess they are starting with linear dynamics here anyway), and quadratic approx of the reward function when using the current policy
- Compute optimal policy for LQR problem in step 1 and use that policy as the current one

- They note most LQR problems do not plan over horizons and that most want to get to the origin, both of which are not the case here, but that it is straightforward to set up the problem so it works in this setting
- Here they actually have a quadratic reward function set up, says they linearized dynamics around the target trajectory in the first iteration, but I though the dynamics were linear anyway (maybe some portions are not?)
- They add a term that penalizes changes in inputs, because the original controller exhibited a sort of bang-bang behavior, this is a problem for some maneuvers, however, that require quick transitions
- To get around this, they do one phase of noiseless planning, and then another with noise, and only penalize for changes needed between the noiseless and noisy versions of the problem
- They add term similar to the integral factor in PID control as well to help account for wind
- They hand-adjust the weights for each term of the penalty function to bring performance to a level similar to that of the expert
- They only needed about 5 minutes of flight data to perform the control