- Policy search tends to require significant amounts of hand-engineering in terms of perception, state estimation, low level control
- They develop a system that learns to map camera images and joint angles directly to joint torques without <or with minimal?> hand engineering
- Uses a deep convo net w/92,000 parameters, 7 layers

- Use hand training to guide learning, nets need to generalize via supervised learning through the NN
- They do placing a hangar on a rack, a cube in a slot, removing a toy nail with a toy hammer, and screwing a cap on a bottle
- <the videos they have make it look like this system doesn’t do an excellent job generalizing>
- The CNN learns to identify key points on surfaces critical to control, such as corners of surfaces
- Backprop in this scenario is difficult because the problem is nondifferentiable, and trying to do it through long periods of time leads to numerical instability
- Because they are doing robotics they need approaches that are sample efficient
- This requires only minutes of demonstration

- These problems are partially observable because information comes from camera and can suffer from occlusion (among other things), and is also high-dimensional
- The form of policy search introduced here is based on “the Bregman alternating directions method of multipliers (BADMM) that makes it practical to train complex, generalizable policies under partial observation.”
- “Naive supervised learning will often fail to produce a good policy, since a small mistake on the part of the policy will put it in states that are not part of the training, causing compounding errors. To avoid this problem, the training data must come from the policy’s own state distribution […]. We use BADMM […] to adapt the trajectories to the policy, alternating between optimizing the policy to match the trajectories , and optimizing the trajectories to minimize the cost and match the policy, such that at convergence, the have the same distribution.”
- <Not reading super carefully, but the method seems to have pretty sophisticated mathematical backing, they make a number of simplifying assumptions to allow it to hold>
- If dynamics are know, form of lqr can be used, otherwise distribution of trajectories found in previous iterations can be used
- They optimize a Lagrangian w/SGD
- They reuse old samples but reweigh them based on the difference in state distribution they came from and where the algorithm is now
- <Some of this seems related to DDP, but this seems more sophisticated>
- On the robot, this runs at 20hz, arm is 7 DoF, the state information is just the camera image and the pose information of the robot itself
- During training they provide additional information to the robot such as location of target objects, this information is removed during the phase where the robot is in control
- The training session is made up of 15 trajectories, which is pretty few considering the dimensionality <although the algorithm doesn’t seem to do a wonderful job generalizing which may come from this>
- They use a dense reward function that helps search (reward is the distance of the end effector to the target)
- The approach doesnt work only from camera data as it doesn’t make an accurate pose estimate, need pose data from the robot
- “Each visuomotor policy required 3-4 hours of training time: 20-30 minutes for the pose prediction data collection on the robot, 40-60 minutes for the fully observed
trajectory pre-training on the robot and offline pose pre-training (which can be done in parallel), and between 1.5 and 2.5 hours for end-to-end training with guided policy search. The coat hanger task required two iterations of guided policy search, the shape sorting cube and the hammer required three, and the bottle task required four.”