Videolectures:

- Using upper confidence bounds to control exploration and exploitation (Csaba Szepesvari)
- Exploration exploitation in Go: UCT for Monte-Carlo Go (Silvain Gelly):
- Each game configuration is a bandit problem
- Each move is an arm
- Best move means maximizing reward (the reward is phrased as the expected discounted reward?)

A Sparse Sampling Algorithm for Near-Optimal Planning in Large MDPs (Kearns, Mansour, Ng):

- Assumes a generative model (so does UCT)
- Provides less information than the actual <S, A, T, R> sets and functions of the MDP, but more information than trace data alone.

- Does roll-outs from a current state. The amount of time required to compute a near-optimal action from any state has no dependence on the number of states in the MDP.
- Running time is exponential in horizon time, but this is shown to be unavoidable without other assumptions.
- The planning algorithm can be viewed as a stochastic policy which happens to use the generative model as a subroutine.
- Theorem 1, There exists a randomized algorithm that given all assumptions above has the qualities:
- Efficiency: Takes time (k/ε(1-γ))^{O(1/(1-γ) log(1/ε(1-γ))}, for action count k
- Near Optimality: for all states, the value under the policy is at most ε off of optimal.

- Designed to be useful in cases where the state space is very large, and standard table representation methods are not practical.
- From each state, each action is tried a constant number of times
*C*during the rollout. - Algorithm estimates discounted reward to some given horizon
*H*. If*H*=log_{γ} ε(1-γ)/Rmax, then the maximum discounted sum of rewards after*H*is ε. - The rollouts may only represent a “vanishingly small” fraction of all trajectories of length
*H*that are possible. Running time is exponential in*H*and 1/1-γ (the ε-horizon).*Why is it not present in the bounds in 5.1*? - The running time can be reduced to the sqrt of the original running time (but still superpolynomial in depth) by choosing
*C*based on depth, s.t.*C*_*i*= γ^{2*i*}*C*, for a depth*i*. This means the width*C*gets thinner as the trajectory deepens, which can make sense since rewards further away are worth less. The error induced by using this rule is pretty close to what it is without the pruning, other pruning methods could be applicable as well.- Memoizing is also an option (collapsing common states at a given depth together).
- Iterative deepening can be used as well.

- Proof for the lower bound on the number of calls made to the generative model (Thm 2 in paper).

Bandit based Monte-Carlo Planning (Kocsis, Szepesvari):

- Previously, Kearns et. al showed that with sparse sampling, finding the ε-optimal action from a start state requires a tree with depth proportional to 1/(1-γ) log(1/(ε(1-γ))), and width proportional to K/(ε(1-γ)), where K is the number of actions.
- Unfortunately, these values are generally too large in practice to be useful in large problems.

- Goals of UCT are to reach a small error probability if computation is terminated early, and to eventually converge to the optimal policy with enough time.
- Claim that in general Monte-Carlo planning seems to be a very good solution for game-tree search.
- UCT applies UCB1 to rollout-based MC planning.
- Does merge results from states that are re-encountered. Without it, the algorithm degenerates in to vanilla MC planning.
- UCB1 chooses an arm based on the mean estimate + bias. The bias
*c*_{*t*,*s*} = sqrt((2 ln*t*)/*s*), using this bias term gives a bound on the probabilities outside those bounds which is <=*t*^{-4} - In UCT this bias term is altered slightly to
*c*_{*t*,*s*} = 2*C*_*p*sqrt(ln*t*/*s*) for some constant*C*_*p.* *I just remembered that this won’t work directly in continuous action spaces because it relies on a count for each <state, action>, which in the continuous setting will not work.**I don’t think I picked up on this the last time I read the paper, but there is a lot of discussion of everything being covered in the nonstationary case. At first I thought it just had to do with nonstationarity in the MDP, but I think it actually has to do with the policy changing over time, so the Q-function changes with time.*- UCB1 never stops exploring. The # of times an arm is pulled is lower bounded by the number of trials.
*I remember this paper being very hand-wavy. The only way I was able to implement it before was by leveraging the UCT/Go paper*(Gelly, Silver)*.*

Some ideas for generalizing to the continuous state case in UCT:

* Allow restarts from lower down in the tree (otherwise, no repetition).

* Use generalization to approximate counts and values so equality isn’t needed. (That’s closer to the method we’ve been talking about.)