- The action-mistake count [AMC(epsilon)] of a policy is the number of times it chooses an action
*a*from state*s*along an infinite length trajectory that has Q(*s*,*a*) < V(*s*) – epsilon - Action Mistake Bound: consider
*B*, which is polynomial in (|S|, |A|, 1/(1-gamma), 1/epsilon). A learning algo is PAC if for all epsilon > 0, AMC(epsilon) <*B* - Call
*J*(*s*) to be the return of a policy starting from*s*. The policy mistake count [PMC(epsilon)] is defined as the count of times over a history that J(*s*) <*V*(*s*) – epsilon - AMC(epsilon) <= PMC(epsilon)
- Use continuity assumption that exists (known) alpha, beta > 0 s.t.:
- |R(
*s1*,*a*) – r(*s2*,*a*)| <= alpha * d(*s1*,*s2*) - d(T(
*s1*,*a*) – T(*s2*,*a*)) >= beta * d(*s1*,*s2*)

- |R(
- Continuity here implies a continuity in the value function
- Define the modulus of continuity of the optimal value function V as:
- w(
*z*) := max(*s1,**s2,*d(*s1*,*s2*)<=*z*) [V(*s1*) V(*s2)*],*z*>0

- w(
- Define the modulus of continuity bound (
*MCB*) as ~w(*z*) s.t.:- w(
*z*) <= ~w(*z*), for all*z*>0 - ~w(
*z*) -> 0 as z->0

- w(
- The values of alpha and beta can be used to set up bounds on the accuracy of the estimates of T and R given the training data and the way the diameters of the kernels are set up
- These bounds and previously described variables are used to compute an upper value function (UVF), which is an upper bound on the optimal value function, but which is difficult to compute in practice, and converges to V* as the kernel supports shrink
- Propose splitting kernels as the visit count in a kernel exceeds some count
- There is a bound on the number of splits taken which bounds the number of possible stationary policies
- A bound is given on the policy mistake count (PMC)
- The manner of computing the upper value function is a contraction operator

## Ideas from Adaptive-Resolution Reinforcement Learning with Effcient Exploration in Deterministic Domains: Andrey Bernstein, Nahum Shimkin

**Tagged**Adaptive-Resolution Reinforcement Learning with Effcient Exploration in Deterministic Domains, Andrey Bernstein, Nahum Shimkin

Are these notes from a paper? Also, it sounds like it has some overlap with Ornomeit and Sen. Are you familiar with that work?

Yes, the authors and title are in the tags but i just noticed they dont show up when reading the post with this skin – I’ll add that to the title.

I’m not aware of the Ornomeit and Sen paper (although I think I’ve heard it mentioned when I wasn’t aware of the context), so I’ll check it out.