Rollout Algorithms for Constrained Dynamic Programming. Bertsekas. LIDS (tech report?) 2005.


<This is a tech report?>

<Seems like it is dealing with deterministic case>

  1. Based on the abstract, seems similar to http://www.mit.edu/~jnt/Papers/J066-97-rollout.pdf, and his other work that deals with heuristics and rollouts
  2. “Under suitable assumptions, we show that if the base heuristic produces a feasible solution, the rollout algorithm also produces a feasible solution, whose cost is no worse than the cost correspoding to the base heuristic.”
  3. <Important!> “In the base heuristic is a DP policy, i.e., it generates state/control sequences via feedback functions μk, k=0,1,…,N-1, so that at any state xk, it applies the control μk(xk), then the algorithm can be viewed as a policy improvement step of the policy iteration method.  Based on the generic results for policy iteration, it then follows that the rollout algorithm’s cost <doing cost minimization in this setting> is no worse than the one of the base heuristic.  This cost improvement property has been extended in [BTW97 <Bertsekas, Tsitsiklis, Wu. Rollout algorithms for combinatorial optimiziation>] to base heuristics that are not DP policies, but rather satisfy an assumption called sequential improvement, which will be discussed shortly within the constrained context of this paper.”
  4. The base heuristic is more like a heuristic policy (as opposed to a value) in that it produces a “partial trajectory”, where the idea seems to be that if you plan to some depth you continue the rollout with the heuristic to give a better estimate of cost
    1. <This fits well with the random policy stuff>
    2. <Fits with Go literature>
  5. There is a given set of trajectories <a priori?>; the planned trajectory followed by the trajectory resulting from the heuristic may not be a member of this set of trajectories.
    1. <I think the paper is set up this way so that it can be used for constraint satisfaction; only trajectories satisfying the constraint are in this set; in this paper they deal with constrained DP as opposed to the more common unconstrained DP.>
  6. “We will show under an appropriate extension of the notion of sequential improvement, that the rollout algorithm produces a feasible solution, whose cost is no worse than the cost corresponding to the base heuristic. We note, however, that it does not appear possible to relate our rollout algorithm to concepts of policy iteration, and in fact we are not aware of an analog of the policy iteration method for constrained DP problems [my emphasis]…”
    1. <So this paper is probably less useful at the moment than his other work because of the focus on constrained DP… may skim the rest of this paper>
  7. For constrained rollouts, when performing action selection, it only considers actions at each time step that will keep the rollout satisfying the constraints
    1. At each decision point, examines all possible actions, and then does heuristic rollouts from each resulting state
    2. Takes the action that had the lowest cost and was still feasible, according to the information generated by the heuristic policy
  8. <Ah yes, this paper is the constrained version of that discussed in  [BTW97 <Bertsekas, Tsitsiklis, Wu. Rollout algorithms for combinatorial optimiziation>]  which I’m already familiar with>
    1. Ok so at this point I’m just going to skim and mine for references, as was recommended to me
  9. In some cases the heuristic will not produce policies that satisfy the constraints, in which case you are in trouble, but they introduce the idea of a heuristic being sequentially consistent, which basically means the policy itself is Markovian (they then say in this case you can just call the heuristic a policy)
  10. A sequentially improving base heuristic has a cost that keeps reducing as the rollout progresses <this actually does not seem trivial to achieve, but they say any sequentially consistent algorithm is also sequentially improving, so I guess it is>, and also that trajectories must be consistent
  11. <I suppose somewhat obviously> that if the base heuristic is sequentially improving, doing deeper rollouts with this heuristic will improve the solution quality
  12. “If the base heuristic is sequentially improving, then the rollout algorithm produces a trajectory … that is feasible and has cost that is no larger than the cost of the trajectory generated by the base heuristic starting from x0.”
  13. Then moves on to what you can do if the trajectory is not feasible
  14. As well as the case where doing trees (multiple partial trajectories) instead of rollouts
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: