- Focuses on relationship between approximate dynamic programming (ADP) and Model Predictive Control (MPC).
- Both of these methods are fundamentally related to policy iteration

- Most common MPC methods can be viewed as policy iteration rollout algorithms
- Also embed rollout and MPC in a new unifying suboptimal control framework, based on a concept of restricted or constrained structure policies, which contains these schemes as special cases
- The term rollout was created by Tesauro from his work on backgammon, where playing through a game involved rolling dice
- In the rollouts described here, after the last decision, an estimate of the value of the following state is also inserted
- It may be desirable to “determinize” the system to build this estimate of the value function

- There is discussion of the error bounds of approximate dynamic programming
- Such as when doing the rollout and then using the value estimate produces better results than just the value estimate
- This occurs whenever the immediate reward + value estimate of the next state is higher than the value estimate of the immediate state (notation here is control theoretic, so minimizing cost, but im reversing the points back here)

- Also, the reward of one step lookahead is always at least as good as zero-step
- In the case that multiple heuristics are available, doing limited rollout is better than
*all*of the heuristics - ehhhh not following the math because I can’t understand why some of the proofs matter, and the notation is strange
- Distinguish between a rollout policy based on some simple heuristic policy, and the part of the policy that is actually being optimized through lookaheads
- The way rollouts are set up in this paper is to make the rollout one step deeper at each step in the algorithm, and then add the estimated value function on the end, potentially based on a heuristic policy. This is for deterministic systems
- Is this is how the rollouts work with UCT in the Go work?
- Actually no, it is a little different here in that the rollouts dont have to be restarted completely, the heuristic value function just help guide the decision making at each step

- Estimated values have to be admissible – that means as the rollout is deepened the policy improves
- Also discuss what happens if you get to a state where actions can’t be applied, or when there are limits to the total amount of action that can be applied, but its not really relevant to the domains I consider
- Discuss POMDPs as well, which I’m not reading closely
- Model Predictive Control was motivated by the desire to introduce nonlinearities and constraints to LQR, while maintaining a suboptimal but stable closed loop system
- Here it is described for a nonlinear deterministic system and nonquadratic cost, with a zero-cost origin
- Penalties must be positive
- The goal is to derive a feedback controller that is stable: state and action go to the origin, that is the total cost from anywhere is finite (its reminds me of a regret way to look at the problem)
- The mentioned a constrained controllability assumption, which just says what is above
- The problem in MPC is to find a policy that reaches the origin and stays there
- Its sort of a different style of paper in that it talks about what properties methods should have in either setting (rollouts or MPC), more than talking about how algorithms actually achieve that, much more a theoretical paper. Not my style in terms of being very mathy for the sake of being mathy. Do you really need to say in every section that there is a constraint that the state the algorithm is in is in the set of states? That type of stuff its just mathematical diarrhea.
- Discusses a “tube” of states the rollout must stay within