## Dynamic Programming and Suboptimal Control: A Survey from ADP to MPC. Bertsekas. European Journal on Control

1. Focuses on relationship between approximate dynamic programming (ADP) and Model Predictive Control (MPC).
1. Both of these methods are fundamentally related to policy iteration
2. Most common MPC methods can be viewed as policy iteration rollout algorithms
3. Also embed rollout and MPC in a new unifying suboptimal control framework, based on a concept of restricted or constrained structure policies, which contains these schemes as special cases
4. The term rollout was created by Tesauro from his work on backgammon, where playing through a game involved rolling dice
5. In the rollouts described here, after the last decision, an estimate of the value of the following state is also inserted
1. It may be desirable to “determinize” the system to build this estimate of the value function
6. There is discussion of the error bounds of approximate dynamic programming
1. Such as when doing the rollout and then using the value estimate produces better results than just the value estimate
2. This occurs whenever the immediate reward + value estimate of the next state is higher than the value estimate of the immediate state (notation here is control theoretic, so minimizing cost, but im reversing the points back here)
7. Also, the reward of one step lookahead is always at least as good as zero-step
8. In the case that multiple heuristics are available, doing limited rollout is better than all of the heuristics
9. ehhhh not following the math because I can’t understand why some of the proofs matter, and the notation is strange
10. Distinguish between a rollout policy based on some simple heuristic policy, and the part of the policy that is actually being optimized through lookaheads
11. The way rollouts are set up in this paper is to make the rollout one step deeper at each step in the algorithm, and then add the estimated value function on the end, potentially based on a heuristic policy.  This is for deterministic systems
1. Is this is how the rollouts work with UCT in the Go work?
2. Actually no, it is a little different here in that the rollouts dont have to be restarted completely, the heuristic value function just help guide the decision making at each step
12. Estimated values have to be admissible – that means as the rollout is deepened the policy improves
13. Also discuss what happens if you get to a state where actions can’t be applied, or when there are limits to the total amount of action that can be applied, but its not really relevant to the domains I consider
14. Discuss POMDPs as well, which I’m not reading closely
15. Model Predictive Control was motivated by the desire to introduce nonlinearities and constraints to LQR, while maintaining a suboptimal but stable closed loop system
16. Here it is described for a nonlinear deterministic system and nonquadratic cost, with a zero-cost origin
17. Penalties must be positive
18. The goal is to derive a feedback controller that is stable: state and action go to the origin, that is the total cost from anywhere is finite (its reminds me of a regret way to look at the problem)
19. The mentioned a constrained controllability assumption, which just says what is above
20. The problem in MPC is to find a policy that reaches the origin and stays there
21. Its sort of a different style of paper in that it talks about what properties methods should have in either setting (rollouts or MPC), more than talking about how algorithms actually achieve that, much more a theoretical paper.  Not my style in terms of being very mathy for the sake of being mathy.  Do you really need to say in every section that there is a constraint that the state the algorithm is in is in the set of states?  That type of stuff its just mathematical diarrhea.
22. Discusses a “tube” of states the rollout must stay within