## Ideas from: Action Elimination and Stopping Conditions for the Multi-Armed Bandit and RL Problems

Action Elimination and Stopping Conditions for the Multi-Armed Bandit and RL Problems

By Eyal Even-Dar, Shie Mannor, Yishay Mansour

Journal of ML 2006

1. Is interested at looking at PAC (instead of regret) style bounds for bandit problems (I think sequential bandit as well)
2. A lower bound was found in Mannor, Tsitsiklis (2004) and matches the upper bound in this paper (so a tight bound was found)
3. Since the goal is PAC bounds, algorithm will probably have particular exlplore, exploit phases
4. Builds on action elimination (AE) algorithms (MacQueen, 2006), which (hopefully quickly) discards (state,actions) from consideration in a policy if they can be determined to be suboptimal
5. Suggested stopping condition is based on difference between upper and lower bounds on the estimated value of an arm – if the difference is small, the greedy policy w.r.t. the lower estimate is near-optimal
6. The naive PAC bandit algorithm is O((n/ε²)log(n/δ))
7. Successive Elimination:
1. Assumes at least the optimal reward is known, but doesn’t really seem to use this information, just seems to do it for notation
2. For Δ_i = p_1 – p_i > 0 , where p_1 is the highest reward of all arms
3. Goal is to sample an arm (1/ (Δ_i)² ln (n/δ)) times and then eliminate it
4. All arms are initially sample the above amount of times, and one is eliminated
5. On the i-th successive step, each of the remaining arms are sampled O((1/(Δ_{n-i})²) – 1/(Δ_{n-i+1})²) ln(n/δ))) times and the worst is eliminated
1. This algorithm is PAC and has a bound of:
1. O(ln(n/δ) ∑_{i=2 to n} 1/ (Δ_i)²)
8. Next an algorithm for successive elimination for unknown biases is given and has 2 bounds, one is PAC, and the other is a stopping bound, with only trivial differences
9. The next algorithm uses Median Elimination, and removes the worst half of the arms at each iteration.  The arm chosen doesnt have to be the best, just ε-optimal
1. It is PAC and has a bound of O(n/ε² log1/δ))
10. Bounds are given for MDPs in finite-horizon undiscounted case, but I’m skipping that now because I generally dont care about that case
11. In infinite horizon discounted case:
1. There is an estimation of upper and lower Q/V values, which is shown to be correct
12. A model-based AE is given when is shown to be PAC (has a stopping criteria)
13. A model-free algorithm based on Q-learning is presented, but only seems to have a proof that holds at the limit.
14. A Q-learning algorithm that has a bandit at each cell is presented (phased Q-L)
1. A quite complex sample complexity is given for it, but it also looks PAC
15. Empirical results of phased Q-L vs regular Q-L, seems to be about twice as efficient in terms of samples/performance