## Nested Rollout Policy Adaptation for Monte Carlo Tree Search. Christopher D. Rosin. IJCAI 2011 Best Paper

• Most MCTS algorithms use a fixed policy, this algorithm adapts the rollout policy during search
• Based on an earlier Nested Monte Carlo Search
• But here use gradient ascent on rollout policy at each level of the “nested search” (whats that)
• Test on crossword puzzle construction and solitare, generates state of the art results for these domains  (the previous record in solitaire was human generated and had stood for 30 years)
• Concerned with deterministic optimization problems here.
• Says in general, nested Monte Carlo search generates state of the art solutions in deterministic problems… what is that?
• Search is initiated at a nesting level n, which is the depth of the recursion for the nested search.
• At each nesting level, the algorithm searches for a solution sequence (a sequence of states that leads to a leaf)
• When attempting to refine its policy, nesting level n calls level n-1, bottoming out at 0
• Says when nesting succeeds, level n is substantially better than n-1
• Doesn’t require smoothness or evaluation heuristics, but can benefit from them
• It can be DAG-ified
• “NRPA may not obtain robust policies that give viable results starting from arbitrary nodes in the tree.  Instead, NRPA’s policies emphasize actions taken by the best solutions so far”
• The parameterization did not change across application domains
• Took about a day of processing to beat  a previous result from a different algorithm that took 38 days to run

————From video of talk

• Effectively “learns” heuristic, so good for case there is no good heuristic
• Dyna-2 is something that sort-of has the same property?
• Policy is describable as some combination of parameters (may be enormous, tens of thousands)
• Don’t care about values of nodes cares about actions.   Identify actions that occur across successful rollouts.
• Use structure of domain to identify successful rollouts
• Rollouts always start from root
• Policies are stochastic sort of like softmax: p(a) = e^policy(code(a)), similar thing has been used for exploration in Go in the past.
• Level 1 search:
• Do n iterations
• Keep track of best policy found
• Update policy by gradient on log-prb of best policy
• Gradient exponentially quick moves to best, so iterations in experiments was small (100)
• Initially policy is totally random, but policy adapts by gradient to best, and does local optimization and then it finishes
• The point of nesting is to cause more global search to be conducted
• For level L>1 search:
• Do n iterations
• Call algorithm for the level below (L-1)
• Keep track of best
• Each level takes n times as long as the level below it, so usually can’t do better than 5 levels
• I’m still not 100% on this
• The difference between the two domains tested is one is wide, the other is deep (although both seem very large)
• Seems like level 1 is initialized totally at random, when run by itself, but when run by level-2 the restarts aren’t totally random; they are taken from the distribution (policies are stochastic) found by the policy at level 2
• Seems like the way this differs from the nested monte carlo search work it is based on is the use of gradient methods to adjust policy
• Not directly applicable to 2-player games, he was motivated by go but hasn’t applied ti to that yet
• Doesn’t maintain statistics over the search tree, just the policy. Memory use is probably minimal then
• There are guarantees for performance in paper