Ideas from Adaptive-Resolution Reinforcement Learning with Effcient Exploration in Deterministic Domains: Andrey Bernstein, Nahum Shimkin

1. The action-mistake count [AMC(epsilon)] of a policy is the number of times it chooses an action a from state s along an infinite length trajectory that has Q(s, a) < V(s) – epsilon
2. Action Mistake Bound: consider B, which is polynomial in (|S|, |A|, 1/(1-gamma), 1/epsilon).  A learning algo is PAC if for all epsilon > 0, AMC(epsilon) < B
3. Call J(s) to be the return of a policy starting from s.  The policy mistake count [PMC(epsilon)] is defined as the count of times over a history that J(s) < V(s) – epsilon
4. AMC(epsilon) <= PMC(epsilon)
5. Use continuity assumption that exists (known) alpha, beta > 0 s.t.:
1. |R(s1, a) – r(s2, a)| <= alpha * d(s1, s2)
2. d(T(s1, a) – T(s2, a)) >= beta * d(s1, s2)
6. Continuity here implies a continuity in the value function
7. Define the modulus of continuity of the optimal value function V as:
1. w(z) := max(s1, s2, d(s1, s2)<=z) [V(s1) V(s2)], z>0
8. Define the modulus of continuity bound (MCB) as ~w(z) s.t.:
1. w(z) <= ~w(z), for all z>0
2. ~w(z) -> 0 as z->0
9. The values of alpha and beta can be used to set up bounds on the accuracy of the estimates of T and R given the training data and the way the diameters of the kernels are set up
10. These bounds and previously described variables are used to compute an upper value function (UVF), which is an upper bound on the optimal value function, but which is difficult to compute in practice, and converges to V* as the kernel supports shrink
11. Propose splitting kernels as the visit count in a kernel exceeds some count
12. There is a bound on the number of splits taken which bounds the number of possible stationary policies
13. A bound is given on the policy mistake count (PMC)
14. The manner of computing the upper value function is a contraction operator