After discussing the idea with Sébastien Bubeck at ICML, I implemented the algorithm which I’ve been calling big-ol-bandit. I’m pretty sure he was very aware that it can be used this way (very similar to Open-loop-optimistic planning), but since I don’t think anyone tried it, it was worth doing (especially since it was pretty trivial working from my existing code base).

Basically the way it works is that it is doing global stochastic optimization over the return of a sequence of actions, whereas HOOT is doing global stochastic optimization over the return from a single action at some state, depth.

Its really just HOO optimizing a high dimensional bandit problem; the only difference in implementation between this and vanilla HOO is that instead of splitting each action dimension sequentially, the action dimension is chosen at random with the probability of each action in the sequence being divided is proportional to its total potential contribution to the return.

If for example, we were only working at a horizon of 2, the probabilities of splitting the first and second depths are γ/(γ+γ^2), (γ^2)/(γ+γ^2), respectively. If the MDP itself is a multi-dimensional action problem, once a depth is chosen one of the actions from that dimension is chosen to be cut uniformly at random. Aside from this its just HOO.

One of the problems with HOOT is that its pretty difficult to come up with bounds for it due to nonstationarity. Regardless, HOOT worked better than UCT in many cases; sparse sampling as we know is totally useless, even though it is the only one with real guarantees.

I was concerned that BOB was going to be similar to sparse sampling, which has bounds, but isn’t really practical. I figured there wouldn’t be enough cutting in the action space in the first level, leading to too much randomness in the choices. To my surprise, it actually seems to work quite nicely.

I ran BOB, HOOT and UCT in the swimmer domain with gamma=0.9 (so this is different from the experiment in the workshop because that used gamma=0.95, and the difference seems to matter). In this setting UCT was just as bad as random for everything tested. HOOT’s performance was better than random. BOB did the best. I called BOB HOLOP (Hierarchical open-loop optimistic planning, a mix of HOO and OLOP).

(plot looks a little noisy because less data points were used than for the workshop paper as it takes a while to run)

So that is quite positive. I’d like to run more domains and get more data points so the plots are nicer, but there is pretty clear separation between BOB and HOOT.

Since this is just a planning algorithm, I was thinking of extending it for use in a learning algorithm. Basically my thoughts are that we can do rollouts with BOB based on estimations of our learned model and we can use MRE to drive exploration during rollouts.

This is nice for a few reasons:

- It should have good exploration
- The algorithm operates entirely on-line (as long as the model building is done on-line)
- It would be a continuous state-action learner (which doesn’t really exist yet)
- Doesn’t do regression on the Q-function, which is both provably (in some cases) difficult, as well as practically tough

Since BOB is a trivial extension of HOO, and doesn’t break any of its assumptions, all the guarantees for HOO hold for BOB. It will converge to the optimal policy based on the model that exists. I think this in combination with the simulation lemma (which the MRE paper also discusses), we should be able to get good formal guarantees for the algorithm.