This paper is great

I am reading this paper because I saw a reference to it elsewhere, where it says if 1-ply rollouts are used (choose one action, then follow policy pi for the remainder of the rollout), the result of that method is provably better than just pi under mild assumptions. This is relevant to the real-time vanilla MC work

- Considers stochastic scheduling problems which I believe means choosing a sequence of decisions to make. The size of the decision space is combinatorial.
- Paper considers the quiz problem:
- You are on a game show, and there are a bunch of questions with different payouts. Your goal is to choose the ordering of the questions that maximizes your cumulative payout until failure

- Discuss heuristics for this setting and rollout algorithms based on these heuristics
- The performance of these rollout policies ends up being near-optimal, and much better than the heuristics themselves

- In the basic quiz setting, questions should just be answered in decreasing order of p_i*v_i/(1-p_i). The greedy policy of answering in descending p_i*v_i is suboptimal
- With small changes, however, the problem becomes very difficult. Examples are:
- The total number of questions you are allowed to attempt is smaller than the entire number of available questions (N)
- Constraints on when in time each question may be selected.
- Ordering/Precedence constraints on questions
- Sequence-dependent rewards, where rewards depend on history

- In all of these variants there is an optimal open-loop policy (I suppose because you either get the question right and proceed, or make a mistake, and the episode terminates)
- There are other variants also listed of the quiz problem, the introduce stochasticity to the problem and are solvable via DP, but the problem is intractable
- In this paper, they propose suboptimal solutions that are computationally intractable
- These solutions are based on
**rollout algorithms**, they say is based on policy iteration

- These solutions are based on
**Rollout algorithms initially proposed by Bertsekas and Tistsiklis, and Wu in 96,97**(Learning to Act Using Real-Time Dynamic Programming, Rollout Algorithms for Combinatorial Optimization)**Generally, rollout algorithms are capable of magnifying the effectiveness of any given heuristic algorithm through sequential application**, by enhancing the policy improvement step of the underlying policy iteration that is occurring- The rollout method described uses a heuristic value (defined for each question), and then greedily chooses questions based on the next best heuristic value. The cost of this is O(MN), where there are N questions, but you are restricted to choosing a total of M of them
- I think the point is this is different from just computing the heuristic values (N times) and then sorting them according to that.

- The superiority of the rollout method as compared to the base heuristic is guaranteed when the heuristic is
**sequentially consistent**, which means that basically the order in which the heuristic chooses things is independent of what of what has already been chosen (aside from the fact that it won’t try to choose something that has already been selected, of course).- This is discussed more in the Bertsekas, Tsitsiklis, Wu paper

- They also discuss a property which is called
*sequentially improving*, part of which guarantees that the rewards of schedules produced are monotonically nonincreasing - Also talk about other rollout methods, such as m-step lookahead that enumerates all possible sequences of length m, and then follows the heuristic after that, and chooses the best initial action
- This is similar to how UCT is used in practice
- There is also another heuristic for how to make this selection cheaper (at the potential cost of performance)

- They run experiments on 7 different algs (2 Heuristics, with no lookahead, 1-step lookahead, and 2-step lookahead):
- Just doing 1-step rollouts leads to very significant (almost doubling) of the performance of heuristics relative to optimal, adding another step of lookahead only helped slightly (sometimes results were equivalent to just 1-step lookahead). For the more difficult problems the second step of lookahead was significantly helpful
- Rollouts help fix heuristics that are myopic
- Section 4 generalizes the setting to what is basically a MDP (undiscounted, it seems)
- Doing one step of policy iteration given a policy pi for an MDP is better than just following pi.
- This is equivalent to saying argmax_a Q^{pi}(s,a) is better than pi(s)

- Cites Tesauro’s work of Monte-Carlo rollouts in Backgammon
- There is some discussion of doing rollouts with the noise parameters clamped to certain values
- Call
*scearia*multiple trajectories where each has different noise - <I’m sort of losing this paper towards the end>
- There are then results for more complex quiz domains
- The results are comparable to the first batch of empirical results

- There is a domain where the questions have graph constraints (like an MDP), and there the benefit of doing rollouts is actually reduced as the heuristics work pretty well by themselves.
- Still doing rollouts helps, and brings the results very close to optimal, while the computational costs are drastically reduced