Online Optimization in XArmed Bandits; Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári (NIPS 2008):
 Only uses local properties of the mean payoff around the maxima; doesn’t require a global property. Mean payoff is locally Holder with known exponent in the neighborhood of any maxima (for a finite number of maxima).
 Regret scales as O~(sqrt(n)), when the space is the unit hypercube, claim rate of growth of regret is independent of dimensionality of input space.
 Claim pretty tight upper and lower bounds (same up to logarithmic factors) in terms of minimax, so claim algorithm cannot be improved significantly.
 Assume support on all arm are uniformly bounded by some known value (assume 1).
 A pulling strategy consists of a mapping of sequences of pulls to the space of probability measures over arms, can be thought of as a mapping of histories (sequence of pulls and their rewards x_i, y_i) to the next action/pull.
 Regret of a history of length n is R*n – ∑_i=0 to n y_i, can also talk about the pseudo regret, where the true value of the function (or its expectation) is used as opposed to the value returned by a pull.
 Basic algorithm works by a tree decomposition. The algorithm arbitrarily picks the child of some node which was previously picked that node gets added to the list of pulled nodes, its children get added to the queue of unpulled nodes and so on.
 They specify the choosing of a node to pull by keeping track of what are called Bvalues. The node to be selected to be pulled is done by starting at the root and following the branch to the child that maximizes the B score recursively.
 Each node is pulled at most once, but results are aggregated up the tree. This allows for mean, count, and confidence bounds to be computed for each node (although it seems these bounds will be extremely loose for the leaves, tighter further up).
 There are two parameters that must be specified in order to compute the bounds, these relate to the scale of the similarity metric.
 Nodes that haven’t been pulled at all have a confidence bound of infinity.
 B(node) = min{upper confidence bound on self, max{B values of children}}
 If axisaligned hyper rectangles are used, proposal is to do split along longest axis of the rectangle, but their math is very general and doesn’t assume this.

Assume mean payoff function is weakly Lipshitz, which I think means that constraints from a point only exist in its ball (seems like this would have global implications anyway though in reasonable spaces).
 The suboptimality of a node yields a bound on the suboptimality of its children (if I’m reading that right)
 Not going into proof on regret here.
 Hey, we could do a random decision tree version of this?
 Something about a near optimality dimension that I’m not getting right now, need to finish again more thoroughly.
Nearly Tight Bounds for the ContinuumArmed Bandit Problem; Kleinberg (NIPS 2004):
 “For a ddimensional strategy space, any multiarmed bandit algorithm must suffer regret depending exponentially on d unless the cost functions are further constrained. (This is demonstrated by a simple counterexample in which the cost function is identically zero in all but one orthant of Rd, takes a negative value somewhere in that orthant, and does not vary over time.)”
 This statement conflicts with the claims of the Xarmed paper
 Achieves regret depending polynomially on d and sublinearly on n.
 Considers uniformly Lipshitz functions.
 Also uses the adversarial approach, which seems the way to go (but also mentions randomized model of cost).
 Describe a randomized bandit algorithm as distribution over deterministic bandit algorithms
 Transforms the continuous space into a discrete space.
 I don’t understand how this proof is possible because the guarantees don’t depend on the resolution of the discretization?
 A proposal for higher dimensional cases is a modification of Zinkevich’s greedy projection algorithm (never heard of this), which provides a an online convex optimization algorithm.
MultiArmed Bandits in Metric Spaces; Robert Kleinberg, Aleksandrs Slivkins, Eli Upfal (STOC 2008):
 Also makes claim about cost being exponential in dimension.
 Says that in some cases (restricted number of samples) you can get the best result simply by picking a discretization resolution and going with that, but further states himself that this is slightly suspicious.
 Introduces the zooming algorithm, which combines the upper confidence bound technique (used in UCB1) with a adaptive refinement step that uses history to focus in on regions near the possible maxima of reward and to explore a denser mesh of strategies in these regions.
 Discuss the covering dimension here (I think its discussed in the Xbandit paper but is less clear). A metric space S has a covering dimension of d if it can be covered by O(δ^d) sets of diameter δ for all δ> 0.
 A point s in S has local covering dimension d if it has an open neighborhood of covering dimension d .
 A space S has maxmin covering of dimension d’ if it has no subspace whose local covering dimension is uniformly bounded below by a number greater than d’.
 The distinction between the two dimension measurements may only be very relevant in spaces they call “highly inhomogenous metric spaces,” If I follow.
 A bandit algorithm exists that satisfies Regret(t) = O(t^γ), for γ > (d’+1)/(d’+2), and does not exist for γ less than that.
 I think this is how the Xbandit paper gets away with a claim that it isn’t dependent on the dimension of the original problem; its dependent on the covering dimension.
 Define the zooming dimension d” as the smallest d” s.t. for all δ>0, only O(δ^d) sets of diameter δ/8 cover the set of strategies whose payoff falls short of the maximum between δ and 2δ.
 Make essentially the same claim with this dimension as with the minmax covering dimension, except γ = (d’‘+1)/(d”+1), whereas in minmax covering dimension it was γ > (d’+1)/(d’+2), so this is tight.
 There are cases where the differences between these two are claimed to be significant.
 Algorithm requires a covering oracle, which given a space and a set of points returns whether the space is covered, or reports where cover is required.
 The actual algorithm/proofs are too complex to outline in notes.
 Talk about the regret dimension of a metric space, in relation to the worstcase problem in that metric space, prooves it.
 Written (with citation) that the “needle in a haystack” problem is the most difficult instance of the bandit problem.
 The rest of the paper is pretty complex and I should give it another good read, may be missing stuff.
“This statement conflicts with the claims of the Xarmed paper”: Does it? In the notes, you gave one bound in terms of steps (n) and one in terms of dimension (d). Oh, then you say “its dependent on the covering dimension” (which is exponential in d in general), so maybe that resolves the issue.
Zinkevich looked at online convex optimization (which turns out to be very related to game theory and regret).
The conflict is that the proofs seem to say regret is bounded by (d’+1)/(d’+2) (in some dimension which may not be the original feature vector represenation). This bound is claimed tight, and is mentioned in the HOO paper, but then somehow disappears and the regret becomes just in the number of samples.
Chris and I spent time looking at that paper (and I think a longer version of it) and were unable to find the justification for its removal. I guess that term is always less than one, so its some root value, and just setting it to 1 makes an upper bound?