Nested Rollout Policy Adaptation for Monte Carlo Tree Search. Christopher D. Rosin. IJCAI 2011 Best Paper

  • Most MCTS algorithms use a fixed policy, this algorithm adapts the rollout policy during search
  • Based on an earlier Nested Monte Carlo Search
    • But here use gradient ascent on rollout policy at each level of the “nested search” (whats that)
    • Test on crossword puzzle construction and solitare, generates state of the art results for these domains  (the previous record in solitaire was human generated and had stood for 30 years)
    • Concerned with deterministic optimization problems here.
    • Says in general, nested Monte Carlo search generates state of the art solutions in deterministic problems… what is that?
    • Search is initiated at a nesting level n, which is the depth of the recursion for the nested search.
      • At each nesting level, the algorithm searches for a solution sequence (a sequence of states that leads to a leaf)
      • When attempting to refine its policy, nesting level n calls level n-1, bottoming out at 0
      • Says when nesting succeeds, level n is substantially better than n-1
      • Doesn’t require smoothness or evaluation heuristics, but can benefit from them
      • It can be DAG-ified
      • “NRPA may not obtain robust policies that give viable results starting from arbitrary nodes in the tree.  Instead, NRPA’s policies emphasize actions taken by the best solutions so far”
      • The parameterization did not change across application domains
      • Took about a day of processing to beat  a previous result from a different algorithm that took 38 days to run

————From video of talk

  • Effectively “learns” heuristic, so good for case there is no good heuristic
    • Dyna-2 is something that sort-of has the same property?
    • Policy is describable as some combination of parameters (may be enormous, tens of thousands)
    • Don’t care about values of nodes cares about actions.   Identify actions that occur across successful rollouts.
    • Use structure of domain to identify successful rollouts
    • Rollouts always start from root
    • Policies are stochastic sort of like softmax: p(a) = e^policy(code(a)), similar thing has been used for exploration in Go in the past.
    • Level 1 search:
      • Do n iterations
        • Keep track of best policy found
        • Update policy by gradient on log-prb of best policy
        • Gradient exponentially quick moves to best, so iterations in experiments was small (100)
        • Initially policy is totally random, but policy adapts by gradient to best, and does local optimization and then it finishes
        • The point of nesting is to cause more global search to be conducted
        • For level L>1 search:
          • Do n iterations
            • Call algorithm for the level below (L-1)
            • Keep track of best
            • Adapt policy by gradient method
  • Each level takes n times as long as the level below it, so usually can’t do better than 5 levels
  • I’m still not 100% on this
  • The difference between the two domains tested is one is wide, the other is deep (although both seem very large)
  • Seems like level 1 is initialized totally at random, when run by itself, but when run by level-2 the restarts aren’t totally random; they are taken from the distribution (policies are stochastic) found by the policy at level 2
  • Seems like the way this differs from the nested monte carlo search work it is based on is the use of gradient methods to adjust policy
  • Not directly applicable to 2-player games, he was motivated by go but hasn’t applied ti to that yet
  • Doesn’t maintain statistics over the search tree, just the policy. Memory use is probably minimal then
  • There are guarantees for performance in paper

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: