Category Archives: Games

Programming a Computer for Playing Chess. Shannon. Philosophical Magazine 1950

I feel myself approaching this paper more as scripture than a scientific paper.
  1. “Although perhaps of no practical importance, the question is of theoretical interest, and it is hoped that a satisfactory solution of this problem will act as a wedge in attacking other problems of a similar  nature and of greater significance.”  He proceeds to list a number of areas that may gain from this research.
  2. “…the proper procedure involves general principles, something of the nature of judgement, and considerable trial and error, rather than a strict, unalterable computing process.”
    1. How the hell did this guy know this back then
  3. <immediately following>”Finally, the solutions of these problems are not merely right or wrong but may have a continuous range of ‘quality’ from best down to worst.  We might be satisfied with a machine that designed good filters even though they were not always the best possible”
  4. Discusses evaluation functions
  5. Discusses brute-force minimax search, and that this is not feasible, and likewise a table lookup for all policies 
  6. Defines a strategy in the same way we do a policy
  7. “Although in chess there is no known simple and exact evaluating function f(P), and probably never will be because of the arbitrary and complicated nature of the rules of the game, it is still possible to perform an approximate evaluation of a position.  Any good chess player must, in fact, be able to perform such a position evaluation.”
  8. Discusses rules of thumb for chess, and how they can be combined to form an evaluation function
  9. Evaluation functions should only be performed for an entire variation (sequence of searches), to fully allow for inclusion of opponent responses
  10. Describes minimax search with cutoffs which use the evaluation function as the function
  11. Its interesting to see how some of the ideas here are totally still relevant today, and other points seem archaic.  For example, he goes to great pains describing how to record a move in a computer program, or check for available moves, which is now taken for granted and considered trivial.
  12. Shannon was able to checkmate a random opponent in 4 or 5 moves, usually
  13. Discusses search methods based on what masters do, which involve doing many plies when simple, but otherwise pruning extensively and only searching a small number of plies ahead
  14. Stochasticity can be introduced so the machine randomly chooses between moves that look nearly the best, in order to make decisions for the opponent more dificult
  15. “The chief weakness is that the machine will not learn by mistakes.  The only way to improve its play is by improving the program.  Some thought has been given to designing a program which is self-improving but, although it appears to be possible, the methods thought of so far do not seem to be very practical.  One possibility is to have a higher level program which changes the terms and coefficients involved in the evaluation function depending on the results of games the machine has played.  Slam variations [Some?-Transcription error? I dont think this is original] might be introduced in these terms and the values selected to give the greatest percentage of ‘wins’.”
  16. Argues that a machine that tries to play based on principles matching board configurations to analogous famous maneuvers “could not be constructed”:
    1. “Although there are various books analyzing combination play and the middle game, they are written for human consumption, not for computing machines.  It is possible to give a person one or two specific examples of a general situation and have him understand and apply they general principles involved.  With a computer an exact and completely explicit characterization of the situation must be given with all limitations, special cases, etc. taken into account.” 
  17. “It is not being suggested that we should design the strategy in our own image.  Rather it should be matched to the capacities and weaknesses of the computer.  The computer is strong in speed and accuracy and weak in analytical abilities and recognition.  Hence, it should make more use of brutal calculations than humans, but with possible variations increasing by a factor of 10^3 every move, a little selection goes a long way forward improving blind trial and error.”

Score Bounded Monte-Carlo Tree Search. Cazenave, Saffidine. Computers and Games

  1. Designed for settings where there are more than 2 possible outcomes in a game (they don’t mention scores though; they mention games that can also end in a draw)
  2. Apply the algorithm to Seki and Go
  3. They discuss MCTS as done with Go in practice, but say finding specific sequences of play in that setting doesn’t work well
    1. Solvers can be used to correct this issue, as they don’t use estimates based on evaluation functions, but are instead based on real (partial) minimax solutions
  4. There are methods for propagating optimistic and pessimistic bounds
    1. Mention B* which I am not familiar with
    2. The optimistic and pessimistic bounds are always correct (the true value is always between them)
    3. Bounds seem to be backed up in the same way as in FS3
  5. They point out how to make the algorithm more efficient in terms of doing computation only when the bounds in the children change
    1. It is possible to do a similar thing in FS3 where we restart rollouts from a point deeper in the tree if its bound hasn’t changed (can treat the root as the last node whose bounds didn’t change). We knew about this but didn’t go in that direction just to keep the algorithm simpler
  6. This algorithm looks like FS3 without the alpha-beta style cutoffs; it only cuts nodes whose bounds are disjoint from the bounds of the parent 

Bayesian Inference in MCTS. Tesauro, Rajan, Segal. UAI 2010.

  1. Is inspired by distribution-free (frequentist) approaches like UCT but uses Bayesian Inference to perform estimates of values
  2. Bayesian approach allows alg designers to leverage evaluation functions differently
  3. When assumptions hold, it is Bayes-optimal 
  4. Stochastic trial results at leaf nodes are combined with the prior to create posterior estimates of value in the nodes that are then propagated upwards, which occurs recursively
  5. For MDPs, distributions are based on a distributional max operator (there is a corresponding min for game trees)
    1. Likewise, there are distribution based upper-confidence bounds
  6. Bayesian estimates allow for faster convergence with limited data compared to UCT
  7. A paper is cited that says average value estimates are (sample) inefficient ad often underestimate the true value
  8. Computation costs are within an order of magnitude of UCT
    1. Because the approach is more sample efficient, it is advantageous in cases where sampling is expensive (such as Go)
  9. Bayesian MCTS is robust to sampling policies.  Convergence of UCT depends on focusing majority of trials on the optimal path
    1. Bayesian MCTS should better leverage parallelization / off policy sampling
  10. Max/min values are based on extremum distributions.  Whats that?
    1. It is a set of random variables
    2. Min, max values are then computed based on this set.  Can be done by integration or Gaussian approximation
  11. The modifications to UCT proposed are very simple
  12.  Gaussian approximations are nice because they are very nice.   Also, some simple calculation can take into account correlation of sibling leaf nodes
  13. Looks like worst-case scenarios that are possible when using Gaussian approximation don’t come up in practice
  14. The empirical domains are very simple, but reasonable

Non-Linear Monte-Carlo Search in Civilization II. S.R.K. Branavan David Silver Regina Barzilay. IJCAI 2011

  1. Based on the title this must be the most awesomest paper ever.  Its the culmination of my entire existence on earth.
  2. Apply non-linear regression within MCTS to estimate Q-function from random rollouts.
  3. Because it generalizes, it doesn’t need many rollouts to make an estimate
  4. Work also pulls out information from the manual (which seems to be the focus of  a separate paper)
  5. The non-linear FA is much more effective than a linear one
  6. Result beats built-in AI 78% of the time.
    1. That is awesome, but this is AI that dates back to 1996, and was designed to work on home-PCs of the era (33 mhz), so we might expect it isn’t amazing.
    2. They actually used FreeCiv, so this isn’t totally right (still designed for 100 mhz machines), and the AI was constrained not to cheat as it does in the real Civ (which is very fair).  I imagine the AI in freeCiv is much better than Civ2, though.
    3. They do, however, treat unfinished games as a loss, so draws are awarded to the AI
  7. The branching factor in the game is extremely huge, so I’m amazed anything works at all.
  8. This is similar to the Go work that used FAs to learn an evaluation function, but there linear FAs were used
  9. The VFA (value function approximator) is built locally in time (from a particular state)
    1. Global VFA was effective in Backgammon, but not Go or Civ, which are larger
  10. Elements of manual relevant to current game state are modeled as hidden variables in the non-linear VFA
  11. Use 4-layer NN for VFA, trained using rollouts.  Layers are as follows:
    1. Game state, action, and game manual
    2. Linguistic model of manual
    3. Fixed features that perform predefined transformations on first 2 layers
    4. Value function
  12. Rollouts train all of these, including the linguistic model
  13. They argue nonlinear VFAs generalize better from a small number of rollouts – thats nonintuitive to me since the hypothesis space of linear VFAs is smaller and should be learnt more quickly (even if its accuracy is worse at the end)
  14. ANNs are helpful because they can process the input in stages
  15. There is already some literature on RL in civ, but this is the first that plays the entire game, as opposed to some subset of it, or only selecting from options
  16. Use game rules and opponent actions as part of the black-box transition function
  17. Manual text that is extracted and fed into the ANN depends on the situation
  18. Its funny they dont at least carry over the ANN from the last step as an initial point to start learning, I guess it biases samples and random sampling is better.
  19. Stanford parser used on the manual
  20. Rollouts are 20 steps long, and uses game score as the evaluation function.  500 rollouts are performed
  21. Almost half a million features fed into the ANN
  22. Exploration is epslion-greedy
  23. Played through a 100 step game in 1.5 hours, not so bad
  24. Compare to 3 other methods:
    1. MCTS: each item in the game computes its search tree indep, this is the only alg that explicitly builds a search tree
    2. Linear MC – similar to the linear VFA method used for Go
    3. Non-linear MC is basically the same method as in the paper, except no manual
    4. Random-Text same as proposed method, but manual has scrambled word order (but same word content)
  25. The other comparison methods are much less effective against the AI (next best is non-linear MC with about 30%).
    1. Regular MCTS didn’t win a single game
  26. Its worth mentioning that they use the Civ2 manual to play freeciv.  While they are very similar, they are not exactly the same game.
  27. There is another paper “Learning to Win by Reading Manuals in a Monte-Carlo Framework” that deals more with just dealing with the manual.

On the Behavior of UCT in Synthetic Search Spaces. Ramanujan, Sbharwal, Selman. ICAPS 2011(?)

  1. Basic point is that A-B search is good in some areas (chess), whereas UCT is good in others (go)
  2. Say failure of A-B in go is due to large branching factor and poor evaluation functions
  3. What other characteristics can make one better than the other in certain domains.  This has been done a bit on real games, but the size of real games makes real analysis difficult.  Idea here is to use synthetic trees that are easy to evaluate to determine strengths and weaknesses
  4. Corrupt (known) heuristic value by gaussian noise
  5. One key is that random sampling deeper in the subtree should produce a value similar to the true value for UCT to work well
  6. Claim that a major characteristic of games where UCT does poorly and A-B does well are games with shallow search traps, where an opponent can easily win a few moves (essentially early terminal states)
  7. Cites a number of papers that discusses various types of synthetic search trees
  8. Theres a bit of math about the particular type of tree theyre using (to show accuracy of estimations of node values) that isn’t otherwise extremely interesting
    1. All that discussion is strange because the trees they are running their algorithms on arent enormous (depth 14), so it isn’t hard to find the real minimax value
  9. Oh, its over?

Best-First Fixed-Depth Minimax Algorithms. Aske Plaat, Jonathan Schaeffer, Wim Pijls, Arie de Bruin. Journal of AI 1996.

  1. “Interesting alternatives to depth-first searching, such as breadth-first and best-first strategies, have been largely ignored in practice”
  2. SSS* is best-fist, while A-B is depth-first
  3. Here they dont use the name AB-SSS*, they call it MT-SSS*?
  4. “MTD can be used to construct a variety of fixed-depth best-first search algorithms using depth-first search.
  5. They say that the assumptions for SSS*’s dominance of A-B aren’t met in practice (due to move reordering and use of iterative deepening).  Empirical results have A-B actually beating MT-SSS* in checkers at some plies for this reason
  6. Talk about significant differences between real-world, and synthetic game trees
  7. MTD(f) is the newest as of the time of writing.  It beats a state of the art implementation of Negascout in terms of total nodes, leaves, and clock time.   I think this is AB-SSS* or very very similar.
  8. They say the connection between A-B and SSS* starts with an upper bound on the minimax value, and the max-solution tree, which is the minimal search tree that proves an upper bound
  9. Most game playing programs use iterative deepening “It is based on the assumption that a shallow search is a good approximation of a deeper search.”  “Among the benefits of iterative deepening (ID) in game-playing programs are better move ordering…, and advantages for tournament time control information.”
  10. “Transposition tables are often used in conjunction with iterative deepening to achieve a partial move ordering.  The search value and the branch leading to the highest score (and best move) are saved for each node.  When iterative deepening searches on level deeper and revisits nodes, the move suggested by the transposition table (if available) is searched first”
  11. Transposition tables are also used to recycle computation when it exists in a graph structure
  12. In emprical results they use very strong programs for comparisons, and modify them as little as possible
  13. The states from games used are states that acutally came up from tournament play; they mention sometimes states used are contrived
  14. They are concerned with node/leaf re-expansions, but in pure trees this isn’t an issue
  15. They say that in practice, memory actually isn’t an issue – and this was written a long time ago
  16. They keep referring to MT-SSS*, but I don’t see where its described explicitly…
  17. The MT (Memory-enhanced Test) framework allows depth-first procedures to implement best-first algorithms

Memory-enhanced Test: a Framework

  1. MT(n,gamma) = A-B(n, gamma-1, gamma) when using transposition tables
  2. One method is to start with gamma = inf and move down from there
  3. Talk about “drivers” for MT, which specify:
    1. n: the root of the tree
    2. first: the starting bound for MT
    3. next: after the completion of a search, how the bounds are updated. This is a function.
  4. MTD would be an algorithm using the MT framework.  They list a number of instantiations of algorithms in the framework.
    1. AB-SSS* = MT-SSS* = MTD(n, inf, bound = g)
    2. Another searches from -inf, then one tries to do a binary search on the true minimax value, they call this MTD(bi)
      1. Mention an equivalent algorithm C* that is also guaranteed to dominate A-B
    3. Can also specify a specific step size to move the bounds by, called MTD(step)
    4. Say MTD(bi) and step seem inferior
  5. Transposition tables prevent waste of reexpansion
  6. MTD(f) is consistently the best algorithm.  They say in practice in these agents, it actually uses less memory than A-B
    1. Says SSS*/MTD(inf) is optimistic (paper seems to be very inconsistent with names, which is even worse across papers
    2. DUAL*/MTD(-inf) is pessimistic
    3. and MTD(f) is realistic
  7. Difference between the algorithms is just a few percent, though (mainly because of the excellent move ordering and other tricks the game programs do aside from the searching itself)
  8. Number of nodes expanded in experiments depends on whether the number of plies is odd or even, which we didn’t see in our synthetic results
  9. If you have a guess as to what the minimax value actually will be, initializing the null window to around that area seems to help (for MTD(f))
  10. “…if MTD(f) is to be effective, the f obtained from the previous iteration must be a good indicator of the next iteration’s value”
  11. Keep in mind the empirical results are relative to very effective game programs that do excellent move ordering and other tricks
  12. Argue against reporting results on synthetic trees:
    1. Real trees have variable branching factor
    2. Value interdependence from parents to children – usually parent values are good predictors for children (a shallow search is a good approximation of a deeper one).  This is probably why UCT can perform so well at Go.
    3. Existence of graph structure, allows for use of transposition table

Appendix on MT-SSS*

  1. Every top-level call to MT fails low and builds a max-solution tree.  The last call to MT (stops the search) fails high and builds a min-solution tree
  2. In the first pass, the left-most solution tree with finite g-value (value of the tree rooted at specified node n) is constructed
  3. During following passes, the following conditions hold:
    1. Before each pass, the tree exposed is the left-most max solution tree with upper bound = gamma.  The children to the left of the min node (child of the root) has already been searched and the lower bound on that subtree is > gamma
    2. During each pass, every node visited belongs to the max solution tree with upper bound = gamma.  For min nodes, children to the left of the child of n (there is only one in the max-solution tree) has a lower bound > gamma and will never be revisited
    3. Whenever nested calls to MT(n, gamma) fail low, a max-solution tree is created where children of min-nodes have property 1

Information-Based Alpha-Beta Seach and the Homicidal Chauffeur. Todd W. Neller. Hybrid Systems: Computation and Control. 2002

  1. Discusses why control techniques that may be locally optimal are often not globally optimal
  2. As title suggests, algorithm proposed is called Information-Based A-B Search
  3. The homicidal chauffeur (HC) is similar to the cat and mouse game, and is in the class of differential games
    1. The version of HC here is slightly different from the normal version (one way is that moves are not simultaneous)
  4. Major differences between this version and classic A-B search is:
    1. Because of the formalism used which involves a delay function (which I don’t really understand) the utility function is computed with an extra step, seems minor
    2. There is a continuum of actions to be selected from
    3. There is also a difference between the normal formalism of finding next states
  5. The action selection method used over the continuum of actions is in the class of max-likelihood optimization techniques, called information-based optimization
    1. The meat of the details is left to another paper
    2. Basic goal is that the algorithm starts with a target value.  At each step of optimization the algorithm evaluates the point most likely to have the value given the history
    3. The version they use assumes parabolic curves in between sample points, so the result is probably similar in character to gaussian process optimization
  6. Even though the action selection is computationally expensive, more efficiently searching the tree can easily compensate for this
    1. In empirical results they bear this out, as search size increases
    2. They see nearly exponential savings in wall time as search breadth increases against random action selection
      1. They have the opposite result for search depth, propose using random deeper in the tree
    3. When facing uniform action discretization they see exponential savings over uniform action discretization
  7. It would be neat to put FS3+HOO against this algorithm.

Optimistic planning of deterministic systems. Jean-Francois Hren, Remi Munos. Recent Avances in RL 2008.

  1. Idea is if you had to do tree search with some finite budget of node expansions(which prevents you from searching the entire tree), whats the best you could do with that budget?
    1. Measured in terms of regret
  2. As you could guess, the tree is explored optimistically, compare to naive exploration
  3. Different from Sparse Sampling because of requirement for determinism
  4. Maintain bounds and drive down the highest upper bound
  5. Say work is similar to Bandit Algorithm for Smooth Trees (BAST), because the discount factor also enforces smoothness in the trees
  6. The regret for uniform planning is Θ(n^-(log(1/γ))/(log K)), for K actions
    1. For optimistic exploration the bound replaces theta with O, and K is replaced with κ which can be at most K and is the branching factor of a related subtree.
  7. The optimistic exploration is never worse than uniform, and much better when κ < K
  8. As in other works Remi is involved in, κ is related to the proportion of near optimal paths in the full look ahead tree
  9. In some non-trivial cases, κ=1 and there is no exponential cost in planning in that case
  10. If the regret algorithm returns an answer with regret ε, the total policy will be ε/1-γ optimal
  11. During optimistic exploration of the tree, the regret is bounded by γ^d_n/(1-γ).  This can be at worst as bad as uniform planning because uniform planning minimizes the depth d_n at all points, and the depth using optimistic planning is at worst the same as uniform, but can be deeper.
  12. The proportion of ε-optimal paths is bounded by cε^β, there are two extreme cases:
    1. When all paths are optimal, β=0 or κ=K, so optimal expansion and uniform expansion are equivalent
    2. The other extreme is when one path has rewards all =1, and all other paths have reward =0.
      1. In this case, the proportion of ε-optimal nodes at depth d is 1/K^d
      2. In this case, κ=1
  13. κ is the branching factor of the tree T_∞.  This tree represents the set of nodes i s.t. given the observed rewards from the root to i, it can’t be determined if i lies on the optimal path or not.  This represents the set of nodes which would have to be expanded for finding an optimal path
  14. Their experiments are more like a concrete illustration, but the results are nice
    1. In this experiment (moving a point mass to the origin) κ approx 1.04, so the branching factor of optimal exploration was quite low
    2. The regret of uniform exploration about 1/n, optimistic exploration about 1/n^3
  15. Paper may have bits relevant to game tree search