Adaptive Bandits: Towards the best history-dependent strategy. Odalric-Ambrym Maillard, Remi Munos (JMLR 2011)

Deals with multi-armed bandit setting with adaptive opponents (explicitly mentions HIV as an example)

Regret is measured w.r.t. best history-dependent strategy

Reward functions are allowed to be arbitrary functions of the history

The discuss operating under a class of constraints on the opponent (must be too difficult without). Constraint can be unknown?

There are some existing algorithms, but they are intractable or nonimplementable

They say the model here is a special case of “sleeping bandits”… whats that?

Talk about equivalence classes of histories, the method can express anything from fully adversarial to vanilla bandits

When only one model is considered the regret is sqrt-ish. The main contribution is a tractable algorithm that has sub-linear regret. Comparisons to optimal hist0ry-based strategy

They call their history-based regret phi-regret. Phi is the equivalence relation on histories

Phi-regret is always at least as big as vanilla regret. When there is only one equivalence class they are equal

Stochastic phi-regret is minimized by using vanilla regret algorithms for each history class

Resulting algorithms are called phi-ucb and phi-exp3

“Thus we show that there always exist an environment such that whatever the strategy of the player is, a particular opponent will lead to visit all history-classes uniformly in expectation.”

Players can depend on phi, opponent can depend on phi and agent. Analysis is of worst (maximal regret?) opponent against the best agent of worst phi

Consider case where opponent can use a number of equivalence relations defined by multiple class functions, which is the phi_theta regret

They say this case in particular is related to sleeping bandits

Basic idea is to use multiple experts, but the performance for each expert is worse than when they are used alone because they may not get the data they want

Point out regret term with theta is log(theta)^(1/2), and is because all models give rewards which provide information about all models, so the learning is faster than might be initially expected

“The discuss operating under a class of constraints on the opponent (must be too difficult without). Constraint can be unknown?”: It might also be that unconstrained opponents are too powerful to be able to say anything useful.

“They say the model here is a special case of “sleeping bandits”… whats that?”: From a quick search, it’s the case where there are k arms, but not all arms available in each round. Some are “asleep”.

“The discuss operating under a class of constraints on the opponent (must be too difficult without). Constraint can be unknown?”: It might also be that unconstrained opponents are too powerful to be able to say anything useful.

“They say the model here is a special case of “sleeping bandits”… whats that?”: From a quick search, it’s the case where there are k arms, but not all arms available in each round. Some are “asleep”.