- Considers case of simple regret, where exploration and exploitation are separated
- In the setting, the agent is allowed
*n*pulls of arms and at the end must recommend an arm. The agent is not necessarily aware of the value of*n*. - Costs during exploration are not in terms of regret, but some other resource, such as computation time
- May also require anytime performance
- The smaller the cumulative regret, the larger the simple regret
- Related to exploration/exploitation tradeoff
- Simple regret can be upper bounded in terms of cumulative regret
- For finite-armed bandits, best simple regret is achieved by sampling each arm a linear number of times, whereas in cumulative regret suboptimal arms should not be sampled more than a logarithmic number of times

- The statement that each arm should get linear number of pulls for best simple regret is true only in case of large
*n*. For smaller values other techniques will yield better results (UCB type is better) - Rate of decrease of simple regret is 1/sqrt(n) in distribution free case, in distribution dependent case the decrease is exponential. I’m not sure what the difference is here though.
- In continuous spaces the regret bound seems to be O(
*n*) which i dont understand