Chapter 1

- Basic idea is to use numerical optimization and method from optimal control, to develop controllers for legged locomotion
- Addresses 3 main problems:
- Finding global policies in high dimensional spaces is very difficult, so instead work on locally optimal solutions that are only poly in state dimension (instead of exp)
- Most control theory methods can’t deal with hard obstacles because it makes the problem nonsmooth. Techniques are introduced that allow the methods to be used in domains with obstacles
- Legged locomotion problems are non-convex, with many local minima. This is addressed by the use of a shaping method called homotopy-continuation, which creates an easier version of the problem which is then used as an initial guess for the harder problem
- This is particularly important because you want to start the optimization at a point that will hillclimb to global optima

- Paper is half in RL and half in optimal control
- Legged walking has non-continuous dynamics because the way motion works when the leg is in free space as opposed to the ground is very different. There are ways to try and work around this, but they introduce other difficulties. Basically, the problem should be represented as a hybrid system, but optimal control can’t deal with those problems (p.3)
- Aside from rigid contacts, joint limits also present another type of discontinuity.

- The approach here uses two different algorithms for contact smoothing, if the problem is treated as a Linear Complementary Problem (there is a paper I have to detail it), and the dynamics are stochastic, the expected next state is a smooth function f the current state
- Another issue in walking is that acceptable behavior is “sparse” in policy space; that is because it is easy to fall down, there are only a small sets of policies that will actually cause the agent to walk. This is not true in the “swimmer” domain, because the worst thing that can happen is the agent simply doesn’t move (or move quickly), so that task is not sparse with in terms of acceptable behavior with regards to the policy space
- This was also my intuition when looking at all the policy search RL papers that deal with controlling a multi-link arm (I believe it is not in an environment with gravity, additionally), which makes gradient methods much easier to use

- The sparseness of good policies here is dealt with by shaping (this is the homotopy-continuation method)
- All the results in the dissertation use shaping (p.5)

- He mentions that policy search is popular for designing motor behavior in high dimensional systems, but these methods require that the policy can be represented by some parameterized function, which usually has to be reasoned about carefully a priori. This is useful because the practitioner can engineer knowledge into the solution, but is problematic because the solutions are not general.
- The work here is more flexible and doesn’t require expectations about optimal behavior

- These methods assume the state is fully observable
- For DDP, they compute a library of optimal trajectories offline and then use that to construct a locally linear controller (p.6)
- Chs 4 + 5 use Bayesian methods, Ch 6 requires inverse dynamics (they use this for the 3-d locomotion)

Chapter 2

- Notation, x is state, time k is indexed by
*x^k*. Action is u, action at step*k*is*u^k*. Dynamics are denoted by*f*(.), next state is*x*‘ (p.8) - Noise is assumed to be normal. (p.9)
- Trajectories are denoted as starting at an initial state
*x*^1, and with a control sequence*U*=*u*^{1:*N*-1}, yields trajectory*X*=*x*^{1:*N*} - Policy is the same as RL:
*π*(*x*). Knowing the policy and the initial state determine a trajectory according to*x*^{*k*+1}=*f*(*x*^*k*, π(*x*^*k*)) - Cost is defined in terms of L:
*l*(*x*,*u*), There may be a terminal cost function l_N(x) at the last state of a traj - Value is the same,
*V^*, optimal policy is*π*(*x*)*π*.*Value of finite horizon policies are lower-indexed - The Q-function is also called the Hamiltonian (p. 11)
- Infinite-horizon cost looks at average value, not discounting
**Local algorithms of optimal control fall into two broad categories: Sequential and Simultaneous.**(p.13)- The finite-horizon sequential approach results in optimizing the action sequence
*U*(check the Diehl reference)**He says the problem with this is that it cannot represent (or optimize) limit cycles. What are they (I think it has to do with periodicity of the policy)? yes**

**Sequential representation underpins the DDP algorithm**(p.14)- I don’t quite grok simultaneous representation. I think it will be discussed more in ch 4.

- The finite-horizon sequential approach results in optimizing the action sequence
- Sequential search actually requires fewer variables, and is guaranteed to by dynamically consistent (I don’t know what that is). Aren’t suited (well I guess) to periodic planning
- For the methods in this paper, efficient computation of dynamics is important (p.16)
- The methods in the dissertation make extensive use of the derivatives of dynamics of the bodies being controlled (done by finite-differencing – I don’t know what that is), which require re-computing the dynamics at a cloud of points around the current state. Other bits I don’t really get but am not sure it matters (p. 17)

Chapter 3: Differential Dynamic Programming

- The algorithm basically works by iteratively doing backward and forward passes. (p.21)
- In the backward pass, it integrates a quadratic approximation of the value function, and modifying the policy along the way
- In the forward pass, another trajectory is run, based on the improved trajectory

- Backward Pass:
- The input to the backward pass is an action sequence, the resulting state trajectory, and return
- This pass produces a quadratic approximation of the value function , and tries to compute a locally optimal action sequence. It does so by stepping backward along the previous trajectory and performing improvements
- Each step of the pass produces a quadratic estimate of the value function of
*x*based on the quadratic estimate of the next state in the trajectory,*x’*, this is where the dynamic programming happens - Then a second order Taylor expansion of the Q function is performed
- Then the derivative of that is taken to find what looks like a gradient of the action (p.21), producing a new policy, and then a new value function for the action which proceeds it in the rollout

- The forward pass updates its policy based on the change in the Q function (p.23)
- This process repeats until no improvement (or degradation) occurs
- In terms of convergence, DDP can be compared to doing naive newton optimization over the multidimensional
*U*, but DDP does so faster (p.24) - The computational requirements of one iteration of DDP is O(Nm^3), which comes from inverting Q matrix at each step of the backward pass.
- In practice though, the most expensive part is the approximation of the quadratic model of
*f*(.), which is*O*(*N*|*S*|^5)

- In practice though, the most expensive part is the approximation of the quadratic model of
- With a highly optimized, custom built physics engine, computing the derivatives of the dynamics takes 90% of computation time. The backward pass takes 9%, and the forward pass 1%. On a vanilla physics engine, computing the derivative takes 99.9% of the time. (p.25)
- Divergence of the value function can occur, if the estimation of the Q function is not positive definite. In order to fix that, a (constant times a) identity matrix is added to the Q matrix, which increases the eigenvalues to make it positive definite
- This introduces a parameter λ

- Another method is possible, which performs regularization by penalizes trajectories that will move far from the previous states (p.26)
- Additionally, other mechanics are needed to make sure the forward pass doesn’t shift too quickly from the last one. This is a problem because dynamics are estimated to be locally quadratic, so moving far away leads to a model which is wrong. (p. 27)
- This can also lead to divergence. I’m not sure why, but they call this line search
- Looks like this introduces a new parameter that has to be controlled (α)

- Because of the parameters that are introduced to prevent divergence, heuristics are used to modify them to help convergence (p.28). For example, if λ is too small, it may diverge, if too large, it wont diverge quickly enough
- A summary of the entire alg is on p. 29
- A couple of issues with DDP mentioned are that behaviors should be for an arbitrarily long time (not
*N*) steps, and some settings are time sensitive (p.30)- The first issue is resolved by doing replanning. They can leverage the computation done at the previous step, and the algorithms converges in just a couple of iterations.
- “We stress that without the fast and exact convergence properties of DDP near the maximum, this algorithm would be far less effective”

- The second issue is resolved by doing nearest-neighbor control with offline optimization

- The first issue is resolved by doing replanning. They can leverage the computation done at the previous step, and the algorithms converges in just a couple of iterations.
- The obstacles used in the swimmer domain aren’t hard obstacles; they are gaussians that add cost at those points
- What he calls “online MPC” I call rollouts with replanning (p.38)
- Using previous steps planning results to bootstrap the next step can cause failure if the previous planning step lead to a local optimum. In this case all the following steps will too, as it starts from that point.
- This issue is particularly problematic in walking domains when the ground collision occurs. For example, in the hopper domain, at the end of the rollout, the hopper wants to avoid touching the ground because that actually slows it down, so what it does is try to stay airborne a little longer. At some point, the hopper can no longer avoid hitting the ground, but ends up in a very poor state because of delaying that event

- The next chapters introduce methods that perform infinite horizon optimization, without getting stuck in local minima

Chapter 4: Bayesian Estimation for Optimal Control

- This optimization is done as infinite horizon by doing optimization over cycles (a sequence of repeated behavior) (p.40)
- The method explicitly represents loops of states, and associate with each one a probability of success
- <This section is not up my alley so I’ll probably get through it quickly>
- With DDP, the approach seeks to find “the optimal control sequence”, with this method, the algorithm checks “Given that we observe success,, what is th most likely route taken?” (p.41)
- Although the approach is different, a lot of the mechanics are the same in terms of finding local approximations of the dynamics and cost functions around a given trajectory
- The Bayesian inference problem reapproximates the system locally an defines the inference problem for the next step. When it converges, the mean of the posterior gives the locally optimal trajectory
- This is the approach they use for the walking with the planar humanoid (foot, shin, thigh, torso)
- There are ways to get rollouts to plan for cycles, but either don’t work well or are computationally wasteful (p.42)
- <skipping most of this chapter>
- Use shaping for these results

Chapter 5: Infinite-horizon Model Predictive Control

- This section is about doing MPC (rollouts with replanning) for domains where optimal behavior results in a limit cycle
- What is impressive is that they compute the policy offline and then use a method based on that for real time decision, but the behvior can recover from all sorts of extreme perturbations; it isn’t stable only in some small region.
- Using MPC with infinite horizon control is generally solved as an LQR problem, which cares about behavior only around the goal state (when assumptions about dynamics are violated). In the absence of a goal state, they aren’t really usable (p.60)
- He looks at average cost and has some references for that, but I’m skipping those papers as I don’t there is much practical difference between avg reward and discounted
- The idea in the chapter is to estimate the optimal limit cycle using the Bayesian approach from chapter 4, and use estimate of the value as the terminal cost when doing the rollouts. If region MPC thinks is optimal agrees with the Bayesian estimates, the policy will remain in that cycle
- The basic process is (p.61):
- Identify optimal limit cycle using the Bayesian approach from ch 4
- Construct local quadratic approximations of
*V*around every state - Use that approximated value function as the terminal cost when doing MPC
- As long as MPC can find a solution where the last state in the rollout is close to the limit cycle, it approximates a good approximation of the optimal behavior

- Use shaping for these results (p.67)
- Unperturbed, the MPC optimization converges after 1 iteration, so it runs in real time
- Even when the model has serious problems in terms of accuracy (mass or segment lengths wildly off), the controller is still effective

Chapter 6: Inverse-dynamics Trajectory Optimization

- Approaches in previous chapters rely on time discretization to be very fine, or will become unstable (p.69): “This computation uses numerical integration, which suffers from numerical stability issues – when the dynamics are stiff or non-smooth, small time-steps are necessary to prevent divergence”
- The fine time resolution is a problem, because the approaches have polynomial costs in the depth of the rollout
- This chapter leverages inverse dynamics, which allows more coarse time resolution
- Inverse dynamics are particularly complicated for settings which have contacts (legs), because it must be able to differentiate between the forces applied by the agent and the reaction forces (compare stopping a foot exactly before hitting the ground, or letting a foot fall to the ground) (p 70)
- They use a special invertible contact model based on convex optimization, which also provides contact smoothing. It does it in a way that doesn’t require it to be modeled as a hybrid system
- They use this method to make a 3-d humanoid model run.
- Because of the part of the gait where the agent is in the air, and has no way to influence the trajectory of its center of mass, it is very difficult to get traditional gait design approaches to work with running

- The humanoid controlled has 31 degrees of freedom
- Again, in domains with contacts, representation is usually hybrid, denoting whether each segment is in free space or not (p.71)
- In general, this means that particular models have to be constructed for each particular circumstance, and even then they can have serious problems performing modeling correctly

- <Skipping the parts related to the inverse model>
- When running, the agent naturally swings its arms in the same way we do when we walk

Chapter 7: Conclusion and Future Work

- I must be misunderstanding parts of the work, because it says “The dynamics is treated as a black box, without applying analytic manipulation to the time-stepping function
*f*, nor to its inverse dynamics counterpart. The use of general-form cost functions and black-box dynamics guarantees the generality and extensibility of these methods” (p.83) - “The use of smooth contact models allows us to apply standard optimization tools to domains with multiple contacts without having to explicitly represent or prescribe the contact times. This means that complex motion patterns with unpredictable number of simultaneous contacts cna be solved just as easily.”
- The interesting aspect in this work is how it stitches together control theory (concerned with provably stable policies, often in a small region), and RL which tries to find globally optimal behavior, but then have to consider a much larger problems