- Deals with multi-armed bandit setting with adaptive opponents (explicitly mentions HIV as an example)
- Regret is measured w.r.t. best history-dependent strategy
- Reward functions are allowed to be arbitrary functions of the history
- The discuss operating under a class of constraints on the opponent (must be too difficult without). Constraint can be unknown?
- There are some existing algorithms, but they are intractable or nonimplementable
- They say the model here is a special case of “sleeping bandits”… whats that?
- Talk about equivalence classes of histories, the method can express anything from fully adversarial to vanilla bandits
- When only one model is considered the regret is sqrt-ish. The main contribution is a tractable algorithm that has sub-linear regret. Comparisons to optimal hist0ry-based strategy
- They call their history-based regret phi-regret. Phi is the equivalence relation on histories
- Phi-regret is always at least as big as vanilla regret. When there is only one equivalence class they are equal
- Stochastic phi-regret is minimized by using vanilla regret algorithms for each history class
- Resulting algorithms are called phi-ucb and phi-exp3
- “Thus we show that there always exist an environment such that whatever the strategy of the player is, a particular opponent will lead to visit all history-classes uniformly in expectation.”
- Players can depend on phi, opponent can depend on phi and agent. Analysis is of worst (maximal regret?) opponent against the best agent of worst phi
- Consider case where opponent can use a number of equivalence relations defined by multiple class functions, which is the phi_theta regret
- They say this case in particular is related to sleeping bandits
- Basic idea is to use multiple experts, but the performance for each expert is worse than when they are used alone because they may not get the data they want
- Point out regret term with theta is log(theta)^(1/2), and is because all models give rewards which provide information about all models, so the learning is faster than might be initially expected

## Adaptive Bandits: Towards the best history-dependent strategy. Odalric-Ambrym Maillard, Remi Munos (JMLR 2011)

**Tagged**ensemble

“The discuss operating under a class of constraints on the opponent (must be too difficult without). Constraint can be unknown?”: It might also be that unconstrained opponents are too powerful to be able to say anything useful.

“They say the model here is a special case of “sleeping bandits”… whats that?”: From a quick search, it’s the case where there are k arms, but not all arms available in each round. Some are “asleep”.