Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization. Deautels, Krause, Burdick. ICML 2012.
- Another paper that I didn’t read very closely, but here are the basic ideas
- Basic idea is dealing with bandits where you have to make a number of decisions and then find out the results of all of them after the selection (batch instead of single arm selection)
- An algorithm called GP-BUCB is developed, the regret of which only increases by a constant factor independent of the batch size B
- Like vanilla GP-UCB, this algorithm needs the arms to continuous, but then discretized for a max operation
- The trick for the algorithm leverages a property of GPs, where the reduction in variance depends only on where samples were taken from, and not what the actual values of the samples were.
- Based on this, it is possible to hypothesize what the new bounds on the algorithm will be once the sample comes back, and based on that, a decision is made.
- Computations of the variance of the function dominate the running time
- They have some math to make sure bounds remain correct during batch selection, so that it does not become “overconfident”
- Applications are some Synthetic problems Automated Vaccine Design, Spinal Cord Therapy (neat)