Optimistic planning of deterministic systems. Jean-Francois Hren, Remi Munos. Recent Avances in RL 2008.

Idea is if you had to do tree search with some finite budget of node expansions(which prevents you from searching the entire tree), whats the best you could do with that budget?

Measured in terms of regret

As you could guess, the tree is explored optimistically, compare to naive exploration

Different from Sparse Sampling because of requirement for determinism

Maintain bounds and drive down the highest upper bound

Say work is similar to Bandit Algorithm for Smooth Trees (BAST), because the discount factor also enforces smoothness in the trees

The regret for uniform planning is Θ(n^-(log(1/γ))/(log K)), for K actions

For optimistic exploration the bound replaces theta with O, and K is replaced with κ which can be at most K and is the branching factor of a related subtree.

The optimistic exploration is never worse than uniform, and much better when κ < K

As in other works Remi is involved in, κ is related to the proportion of near optimal paths in the full look ahead tree

In some non-trivial cases, κ=1 and there is no exponential cost in planning in that case

If the regret algorithm returns an answer with regret ε, the total policy will be ε/1-γ optimal

During optimistic exploration of the tree, the regret is bounded by γ^d_n/(1-γ). This can be at worst as bad as uniform planning because uniform planning minimizes the depth d_n at all points, and the depth using optimistic planning is at worst the same as uniform, but can be deeper.

The proportion of ε-optimal paths is bounded by cε^β, there are two extreme cases:

When all paths are optimal, β=0 or κ=K, so optimal expansion and uniform expansion are equivalent

The other extreme is when one path has rewards all =1, and all other paths have reward =0.

In this case, the proportion of ε-optimal nodes at depth d is 1/K^d

In this case, κ=1

κ is the branching factor of the tree T_∞. This tree represents the set of nodes i s.t. given the observed rewards from the root to i, it can’t be determined if i lies on the optimal path or not. This represents the set of nodes which would have to be expanded for finding an optimal path

Their experiments are more like a concrete illustration, but the results are nice

In this experiment (moving a point mass to the origin) κ approx 1.04, so the branching factor of optimal exploration was quite low

The regret of uniform exploration about 1/n, optimistic exploration about 1/n^3

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here:
Cookie Policy