Pure Exploration in Finitely-Armed and Continuous-Armed Bandits: Bubeck, Munos, Stoltz (Arxiv version)

Considers case of simple regret, where exploration and exploitation are separated

In the setting, the agent is allowed n pulls of arms and at the end must recommend an arm. The agent is not necessarily aware of the value of n.

Costs during exploration are not in terms of regret, but some other resource, such as computation time

May also require anytime performance

The smaller the cumulative regret, the larger the simple regret

Related to exploration/exploitation tradeoff

Simple regret can be upper bounded in terms of cumulative regret

For finite-armed bandits, best simple regret is achieved by sampling each arm a linear number of times, whereas in cumulative regret suboptimal arms should not be sampled more than a logarithmic number of times

The statement that each arm should get linear number of pulls for best simple regret is true only in case of large n. For smaller values other techniques will yield better results (UCB type is better)

Rate of decrease of simple regret is 1/sqrt(n) in distribution free case, in distribution dependent case the decrease is exponential. I’m not sure what the difference is here though.

In continuous spaces the regret bound seems to be O(n) which i dont understand

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here:
Cookie Policy