- Clearly, basically about contextual bandits
- Introduce a replay methodology for contextual bandit algorithm evaluation.
- Totally data driven (another method is to construct a simulator)
- Method is provably unbiased
- Motivation is similar to what we did in Pitfall. It can be costly to run each potential method in real life, so the goal is to generate enough data that we can effectively simulate each method from the logged data
- “…data in bandit-style applications only contain user feedback for recommendations that were actually displayed to the user, but not all candidates. This ‘partial-label’ nature raises a difficulty that is the key difference between evaluation of bandit algorithms and supervised learning ones.”
- The treat RL from batch data as an “off-policy evaluation problem”
- If I read this again skip everything before section 3.
- I don’t get why, but the claim is that each history has an identical probability in the real world as in the policy evaluator. Therefore,
- Basically the process is:
- Go through data corpus sample by sample
- For a logged sample <s_i,a_i,r_i>, give the algorithm the context/state, s_i
- The algorithm returns desired action a
- If a_i =a, give the sample to the program to update its history. Else just ignore this sample and move on
- Estimation error approaches 0 with increasing data.
- Error is O(sqrt(K/|D|)), |D| is size of data set; drops off quickly
- Empirical results of results on corpus consisting of Yahoo click through, actual results fit very nicely with bounds