Optimistic planning of deterministic systems. Jean-Francois Hren, Remi Munos. Recent Avances in RL 2008.

  1. Idea is if you had to do tree search with some finite budget of node expansions(which prevents you from searching the entire tree), whats the best you could do with that budget?
    1. Measured in terms of regret
  2. As you could guess, the tree is explored optimistically, compare to naive exploration
  3. Different from Sparse Sampling because of requirement for determinism
  4. Maintain bounds and drive down the highest upper bound
  5. Say work is similar to Bandit Algorithm for Smooth Trees (BAST), because the discount factor also enforces smoothness in the trees
  6. The regret for uniform planning is Θ(n^-(log(1/γ))/(log K)), for K actions
    1. For optimistic exploration the bound replaces theta with O, and K is replaced with κ which can be at most K and is the branching factor of a related subtree.
  7. The optimistic exploration is never worse than uniform, and much better when κ < K
  8. As in other works Remi is involved in, κ is related to the proportion of near optimal paths in the full look ahead tree
  9. In some non-trivial cases, κ=1 and there is no exponential cost in planning in that case
  10. If the regret algorithm returns an answer with regret ε, the total policy will be ε/1-γ optimal
  11. During optimistic exploration of the tree, the regret is bounded by γ^d_n/(1-γ).  This can be at worst as bad as uniform planning because uniform planning minimizes the depth d_n at all points, and the depth using optimistic planning is at worst the same as uniform, but can be deeper.
  12. The proportion of ε-optimal paths is bounded by cε^β, there are two extreme cases:
    1. When all paths are optimal, β=0 or κ=K, so optimal expansion and uniform expansion are equivalent
    2. The other extreme is when one path has rewards all =1, and all other paths have reward =0.
      1. In this case, the proportion of ε-optimal nodes at depth d is 1/K^d
      2. In this case, κ=1
  13. κ is the branching factor of the tree T_∞.  This tree represents the set of nodes i s.t. given the observed rewards from the root to i, it can’t be determined if i lies on the optimal path or not.  This represents the set of nodes which would have to be expanded for finding an optimal path
  14. Their experiments are more like a concrete illustration, but the results are nice
    1. In this experiment (moving a point mass to the origin) κ approx 1.04, so the branching factor of optimal exploration was quite low
    2. The regret of uniform exploration about 1/n, optimistic exploration about 1/n^3
  15. Paper may have bits relevant to game tree search

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: