## Stochastic Setting: Bandit geometric upper bound estimator implemented, compared to HOO

In my previous post the geometric approach was empirically compared to HOO in the deterministic setting.  The next step was to adapt the geometric approach to the stochastic setting.  The change is quite simple: whereas in the deterministic setting it is only necessary to sample any arm once, in the deterministic setting that exact arm must be sampled repeatedly.  Once that has been done a certain number of times (call that m), it is possible to estimate a mean along with confidence intervals.  Here, the upper value on a 95% confidence interval was used.  This value is simply used in place of the actual value that would have been observed in the deterministic setting.

HOO is already designed for the stochastic setting, and works out of the box.

I would show graphs like I did in previous posts of how the sampling actually looks, but it appears quite similar to the deterministic setting.

Anyway, the important bit is the comparison of the regret curves.  This domain was very similar to the one used in the last (deterministic) experiment, except this time noise is corrupted by white noise with a standard deviation of 0.5, which is quite high considering the function only ranged from about 8 to -10:

Regret of geometric approach compared to HOO in stochastic setting

Although the time scale of where the crossover appears is much larger than that in the deterministic setting, it still occurs, and the general appearance of the graph is the same.

## 2 thoughts on “Stochastic Setting: Bandit geometric upper bound estimator implemented, compared to HOO”

1. Michael Littman says:

using GPs or something similar, it should be possible to get confidence intervals without sampling each point a bunch of times. linear regression with upper confidence intervals (a la Istvan) would also be interesting to compare.

one thing missing from these reports—your takeaway message. what do the results mean?

2. hut3 says:

That (using GPs) is a good point I didn’t think about, but with GPs there is the additional issue of knowing which length-scale parameter to use, without it we will get incorrect bounds. I’m not sure if there is any literature that discusses what length-scale to use if you have a known Lipshitz constant, but that could be great.

Looking into Istvan’s (sorry if you are reading this, I don’t know how to write the accents ;) also sounds like a good idea. I was wondering if it would be possible to work this into KWIK somehow.

As far as the results, both you and Ali pointed out the shapes of the curves represent the characteristic differences between regret-bounded algorithms and PAC-bounded algorithms. I think I know which one I would prefer to use.

Although the HOO algorithm seems to be outperformed in this domain, it is more general as it doesn’t assume global smoothness and works in multiple dimensions. Getting similar guarantees for this approach may be possible, but the math looks like it is going to get horrible.

I have to admit, I was surprised the performance would be this good in the stochastic case, I was expecting the geometric approach to be superior in the deterministic setting and HOO to be superior in stochastic settings, as the algorithm by nature seems like it would be more resilient to noisy signals.

The fact that it works this well (also happens to run quickly) gives me hope that it can be generalized to MDPs in the future. Although I’m not sure how the proofs will look there, it does seem to be an effective approach.