- Most MCTS algorithms use a fixed policy, this algorithm adapts the rollout policy during search
- Based on an earlier Nested Monte Carlo Search
- But here use gradient ascent on rollout policy at each level of the “nested search” (whats that)
- Test on crossword puzzle construction and solitare, generates state of the art results for these domains (the previous record in solitaire was human generated and had stood for 30 years)
- Concerned with deterministic optimization problems here.
- Says in general, nested Monte Carlo search generates state of the art solutions in deterministic problems… what is that?
- Search is initiated at a nesting level
*n*, which is the depth of the recursion for the nested search.- At each nesting level, the algorithm searches for a solution sequence (a sequence of states that leads to a leaf)
- When attempting to refine its policy, nesting level
*n*calls level*n*-1, bottoming out at 0 - Says when nesting succeeds, level
*n*is substantially better than*n*-1 - Doesn’t require smoothness or evaluation heuristics, but can benefit from them
- It can be DAG-ified
- “NRPA may not obtain robust policies that give viable results starting from arbitrary nodes in the tree. Instead, NRPA’s policies emphasize actions taken by the best solutions so far”
- The parameterization did not change across application domains
- Took about a day of processing to beat a previous result from a different algorithm that took 38 days to run

————From video of talk

- Effectively “learns” heuristic, so good for case there is no good heuristic
- Dyna-2 is something that sort-of has the same property?
- Policy is describable as some combination of parameters (may be enormous, tens of thousands)
- Don’t care about values of nodes cares about actions. Identify actions that occur across successful rollouts.
- Use structure of domain to identify successful rollouts
- Rollouts always start from root
- Policies are stochastic sort of like softmax: p(a) = e^policy(code(a)), similar thing has been used for exploration in Go in the past.
- Level 1 search:
- Do n iterations
- Keep track of best policy found
- Update policy by gradient on log-prb of best policy
- Gradient exponentially quick moves to best, so iterations in experiments was small (100)
- Initially policy is totally random, but policy adapts by gradient to best, and does local optimization and then it finishes
- The point of nesting is to cause more global search to be conducted
- For level L>1 search:
- Do n iterations
- Call algorithm for the level below (L-1)
- Keep track of best
- Adapt policy by gradient method

- Do n iterations

- Do n iterations

- Each level takes n times as long as the level below it, so usually can’t do better than 5 levels
- I’m still not 100% on this
- The difference between the two domains tested is one is wide, the other is deep (although both seem very large)
- Seems like level 1 is initialized totally at random, when run by itself, but when run by level-2 the restarts aren’t totally random; they are taken from the distribution (policies are stochastic) found by the policy at level 2
- Seems like the way this differs from the nested monte carlo search work it is based on is the use of gradient methods to adjust policy
- Not directly applicable to 2-player games, he was motivated by go but hasn’t applied ti to that yet
- Doesn’t maintain statistics over the search tree, just the policy. Memory use is probably minimal then
- There are guarantees for performance in paper

Advertisements