Action Elimination and Stopping Conditions for the Multi-Armed Bandit and RL Problems

By Eyal Even-Dar, Shie Mannor, Yishay Mansour

Journal of ML 2006

- Is interested at looking at PAC (instead of regret) style bounds for bandit problems (I think sequential bandit as well)
- A lower bound was found in Mannor, Tsitsiklis (2004) and matches the upper bound in this paper (so a tight bound was found)
- Since the goal is PAC bounds, algorithm will probably have particular exlplore, exploit phases
- Builds on action elimination (AE) algorithms (MacQueen, 2006), which (hopefully quickly) discards (state,actions) from consideration in a policy if they can be determined to be suboptimal
- Suggested stopping condition is based on difference between upper and lower bounds on the estimated value of an arm – if the difference is small, the greedy policy w.r.t. the
*lower*estimate is near-optimal - The naive PAC bandit algorithm is O((n/ε²)log(n/δ))
- Successive Elimination:
- Assumes at least the optimal reward is known, but doesn’t really seem to use this information, just seems to do it for notation
- For Δ_i = p_1 – p_i > 0 , where p_1 is the highest reward of all arms
- Goal is to sample an arm (1/ (Δ_i)² ln (n/δ)) times and then eliminate it
- All arms are initially sample the above amount of times, and one is eliminated
- On the i-th successive step, each of the remaining arms are sampled O((1/(Δ_{n-i})²) – 1/(Δ_{n-i+1})²) ln(n/δ))) times and the worst is eliminated

- This algorithm is PAC and has a bound of:
- O(ln(n/δ) ∑_{i=2 to n} 1/ (Δ_i)²)

- Next an algorithm for successive elimination for unknown biases is given and has 2 bounds, one is PAC, and the other is a stopping bound, with only trivial differences
- The next algorithm uses Median Elimination, and removes the worst half of the arms at each iteration. The arm chosen doesnt have to be the best, just ε-optimal
- It is PAC and has a bound of O(n/ε² log1/δ))

- Bounds are given for MDPs in finite-horizon undiscounted case, but I’m skipping that now because I generally dont care about that case
- In infinite horizon discounted case:
- There is an estimation of upper and lower Q/V values, which is shown to be correct

- A model-based AE is given when is shown to be PAC (has a stopping criteria)
- A model-free algorithm based on Q-learning is presented, but only seems to have a proof that holds at the limit.
- A Q-learning algorithm that has a bandit at each cell is presented (phased Q-L)
- A quite complex sample complexity is given for it, but it also looks PAC

- Empirical results of phased Q-L vs regular Q-L, seems to be about twice as efficient in terms of samples/performance