Adaptive Bandits: Towards the best history-dependent strategy. Odalric-Ambrym Maillard, Remi Munos (JMLR 2011)


  1. Deals with multi-armed bandit setting with adaptive opponents (explicitly mentions HIV as an example)
  2. Regret is measured w.r.t. best history-dependent strategy
  3. Reward functions are allowed to be arbitrary functions of the history
  4. The discuss operating under a class of constraints on the opponent (must be too difficult without).  Constraint can be unknown?
  5. There are some existing algorithms, but they are intractable or nonimplementable
  6. They say the model here is a special case of “sleeping bandits”… whats that?
  7. Talk about equivalence classes of histories, the method can express anything from fully adversarial to vanilla bandits
  8. When only one model is considered the regret is sqrt-ish.  The main contribution is a tractable algorithm that has sub-linear regret.   Comparisons to optimal hist0ry-based strategy
  9. They call their history-based regret phi-regret.  Phi is the equivalence relation on histories
    1. Phi-regret is always at least as big as vanilla regret.  When there is only one equivalence class they are equal
  10. Stochastic phi-regret is minimized by using vanilla regret algorithms for each history class
    1. Resulting algorithms are called phi-ucb and phi-exp3
  11. “Thus we show that there always exist an environment such that whatever the strategy of the player is, a particular opponent will lead to visit all history-classes uniformly in expectation.”
  12. Players can depend on phi, opponent can depend on phi and agent.  Analysis is of worst (maximal regret?) opponent against the best agent of worst phi
  13. Consider case where opponent can use a number of equivalence relations defined by multiple class functions, which is the phi_theta regret
    1. They say this case in particular is related to sleeping bandits
  14. Basic idea is to use multiple experts, but the performance for each expert is worse than when they are used alone because they may not get the data they want
  15. Point out regret term with theta is log(theta)^(1/2), and is because all models give rewards which provide information about all models, so the learning is faster than might be initially expected
Advertisements
Tagged

One thought on “Adaptive Bandits: Towards the best history-dependent strategy. Odalric-Ambrym Maillard, Remi Munos (JMLR 2011)

  1. Michael Littman says:

    “The discuss operating under a class of constraints on the opponent (must be too difficult without). Constraint can be unknown?”: It might also be that unconstrained opponents are too powerful to be able to say anything useful.

    “They say the model here is a special case of “sleeping bandits”… whats that?”: From a quick search, it’s the case where there are k arms, but not all arms available in each round. Some are “asleep”.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: