On-line Policy Improvement Using Monte-Carlo Search

The basic takeaway of this paper is that doing even limited rollouts can significantly improve the performance of poor policies

For domains with stochasticity, Monte-Carlo estimations are necessary (so the implication is the policy is determininstic)
Discusses root parallelization and pruning heuristics
On average, rollouts in backgammon need to be around 30 steps long to play to completion
There are generally about 20 legal moves that can be considered at each step, differences in values of initial actions generally vary by 0.01, when scores range from 1 to 3 (or negative) for a win, gammon, and backgammon, respectively
Based on this, using pure Monte-Carlo sampling it would be necessary to perform hundreds of thousands of rollouts
1. With pruning, roughly a million decisions have to be made to come to a result, typical tournament level human players take roughly 10 seconds
Adding rollouts makes linear policies go from -0.5 to essentially 0 (the opponent is most basic configuration of TD-Gammon 2.1 with no lookahead)
Next experiments do limited length rollouts (7 or 11 steps) and then use ANN for “equity” (evaluation) function
Points to a paper by Shannon from 1950 that discusses rollouts

Ari Weinstein's Research