- Deals with trying to use Q-Learning on energy storage and retrieval from renewable energies.
- Q-Learning performed is on discrete state, 2 action, but has extremely high variance
- This variance means QL can diverge for millions of iterations

- The max operator introduces bias when operating over noisy data. Aside from the citations I am familiar on this topic, there are results from other fields as well. Looks like there is some research on how to correct the issue that I am not familiar with
- The introduction of bias is particularly problematic when gamma ~= 1, which occurs in the cases they consider due to very short time steps (need for high gamma/long look ahead seems to be a recurring theme in work from this lab)
- They introduce a method to correct the bias introduced by the max operator
- They:
- illustrate the minimal condition to cause max-bias to occur in q-learning
- produce an algorithm that prevents this from occurring
- Show empirical results

- Q-Learning step size has a strong impact on bias. Large step size causes faster convergence when there is no noise, but lower step size reduces bias.
- To correct the bias, there is one additional term in the normal QL update that is the expectation of the bias
- (although what goes into computing that term is a bit complex)
- There is an assumption that the reward distribution is the same for all actions

- “In most cases, the bias correction term overestimates the actual max-induced bias and puts negative bias on the stochastic sample. However, this is a much milder form of bias. In a maximization problem, positive bias takes many more iterations to be removed than negative bias because QL propagates positive bias due to the max operator over Q values”
- In empirical results, they run for 3 million samples, although a significant amount of change occurs by 1/2 million. Bias-corrected still overestimates value compared to truth (even though at the beginning the estimate is lower than truth), but error is much smaller than vanilla QL
- They compute ground truth by doing value iteration, with a transition matrix estimated by performing monte carlo sampling
- They also run another more complex experiment where computing the true value function is more difficult, and simply determine when QL converges, and then report the error of converged states (I would also like to see what the rate, or % of converged states is for each). Although here they also have the true value function estimated
- In the last experiment there is no comparison to true value