On-line Policy Improvement Using Monte-Carlo Search

The basic takeaway of this paper is that doing even limited rollouts can significantly improve the performance of poor policies
  1. For domains with stochasticity, Monte-Carlo estimations are necessary (so the implication is the policy is determininstic)
  2. Discusses root parallelization and pruning heuristics
  3. On average, rollouts in backgammon need to be around 30 steps long to play to completion
  4. There are generally about 20 legal moves that can be considered at each step, differences in values of initial actions generally vary by 0.01, when scores range from 1 to 3 (or negative) for a win, gammon, and backgammon, respectively
  5. Based on this, using pure Monte-Carlo sampling it would be necessary to perform hundreds of thousands of rollouts
    1. With pruning, roughly a million decisions have to be made to come to a result, typical tournament level human players take roughly 10 seconds
  6. Adding rollouts makes linear policies go from -0.5 to essentially 0 (the opponent is most basic configuration of TD-Gammon 2.1 with no lookahead)
  7. Next experiments do limited length rollouts (7 or 11 steps) and then use ANN for “equity” (evaluation) function
  8. Points to a paper by Shannon from 1950 that discusses rollouts

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: