<This is a tech report?>

<Seems like it is dealing with deterministic case>

- Based on the abstract, seems similar to http://www.mit.edu/~jnt/Papers/J066-97-rollout.pdf, and his other work that deals with heuristics and rollouts
- “Under suitable assumptions, we show that if the base heuristic produces a feasible solution, the rollout algorithm also produces a feasible solution, whose cost is no worse than the cost correspoding to the base heuristic.”
- <Important!> “In the base heuristic is a DP policy, i.e., it generates state/control sequences via feedback functions μ
,_{k}*k*=0,1,…,N-1, so that at any state*x*, it applies the control μ_{k}(_{k}*x*), then the algorithm can be viewed as a policy improvement step of the policy iteration method. Based on the generic results for policy iteration, it then follows that the rollout algorithm’s cost <doing cost minimization in this setting> is no worse than the one of the base heuristic. This cost improvement property has been extended in [BTW97 <Bertsekas, Tsitsiklis, Wu. Rollout algorithms for combinatorial optimiziation>] to base heuristics that are not DP policies, but rather satisfy an assumption called_{k}*sequential improvement*, which will be discussed shortly within the constrained context of this paper.” - The base heuristic is more like a heuristic policy (as opposed to a value) in that it produces a “partial trajectory”, where the idea seems to be that if you plan to some depth you continue the rollout with the heuristic to give a better estimate of cost
- <This fits well with the random policy stuff>
- <Fits with Go literature>

- There is a given set of trajectories <a priori?>; the planned trajectory followed by the trajectory resulting from the heuristic may not be a member of this set of trajectories.
- <I think the paper is set up this way so that it can be used for constraint satisfaction; only trajectories satisfying the constraint are in this set;
**in this paper they deal with constrained DP as opposed to the more common unconstrained DP**.>

- <I think the paper is set up this way so that it can be used for constraint satisfaction; only trajectories satisfying the constraint are in this set;
- “We will show under an appropriate extension of the notion of sequential improvement, that the rollout algorithm produces a feasible solution, whose cost is no worse than the cost corresponding to the base heuristic. We note, however, that
**it does not appear possible to relate our rollout algorithm to concepts of policy iteration, and in fact we are not aware of an analog of the policy iteration method for constrained DP problems**[my emphasis]…”- <So this paper is probably less useful at the moment than his other work because of the focus on constrained DP… may skim the rest of this paper>

- For constrained rollouts, when performing action selection, it only considers actions at each time step that will keep the rollout satisfying the constraints
- At each decision point, examines all possible actions, and then does heuristic rollouts from each resulting state
- Takes the action that had the lowest cost
**and**was still feasible, according to the information generated by the heuristic policy

**<Ah yes, this paper is the constrained version of that discussed in [BTW97 <Bertsekas, Tsitsiklis, Wu. Rollout algorithms for combinatorial optimiziation>] which I’m already familiar with>**- Ok so at this point I’m just going to skim and mine for references, as was recommended to me

- In some cases the heuristic will not produce policies that satisfy the constraints, in which case you are in trouble, but they introduce the idea of a heuristic being
*sequentially consistent*, which basically means the policy itself is Markovian (they then say in this case you can just call the heuristic a policy) - A sequentially improving base heuristic has a cost that keeps reducing as the rollout progresses <this actually does not seem trivial to achieve, but they say any sequentially consistent algorithm is also sequentially improving, so I guess it is>, and also that trajectories must be consistent
- <I suppose somewhat obviously> that if the base heuristic is sequentially improving, doing deeper rollouts with this heuristic will improve the solution quality
- “If the base heuristic is sequentially improving, then the rollout algorithm produces a trajectory … that is feasible and has cost that is no larger than the cost of the trajectory generated by the base heuristic starting from x
_{0}.” - Then moves on to what you can do if the trajectory is not feasible
- As well as the case where doing trees (multiple partial trajectories) instead of rollouts