Ideas from Adaptive-Resolution Reinforcement Learning with Effcient Exploration in Deterministic Domains: Andrey Bernstein, Nahum Shimkin


  1. The action-mistake count [AMC(epsilon)] of a policy is the number of times it chooses an action a from state s along an infinite length trajectory that has Q(s, a) < V(s) – epsilon
  2. Action Mistake Bound: consider B, which is polynomial in (|S|, |A|, 1/(1-gamma), 1/epsilon).  A learning algo is PAC if for all epsilon > 0, AMC(epsilon) < B
  3. Call J(s) to be the return of a policy starting from s.  The policy mistake count [PMC(epsilon)] is defined as the count of times over a history that J(s) < V(s) – epsilon
  4. AMC(epsilon) <= PMC(epsilon)
  5. Use continuity assumption that exists (known) alpha, beta > 0 s.t.:
    1. |R(s1, a) – r(s2, a)| <= alpha * d(s1, s2)
    2. d(T(s1, a) – T(s2, a)) >= beta * d(s1, s2)
  6. Continuity here implies a continuity in the value function
  7. Define the modulus of continuity of the optimal value function V as:
    1. w(z) := max(s1, s2, d(s1, s2)<=z) [V(s1) V(s2)], z>0
  8. Define the modulus of continuity bound (MCB) as ~w(z) s.t.:
    1. w(z) <= ~w(z), for all z>0
    2. ~w(z) -> 0 as z->0
  9. The values of alpha and beta can be used to set up bounds on the accuracy of the estimates of T and R given the training data and the way the diameters of the kernels are set up
  10. These bounds and previously described variables are used to compute an upper value function (UVF), which is an upper bound on the optimal value function, but which is difficult to compute in practice, and converges to V* as the kernel supports shrink
  11. Propose splitting kernels as the visit count in a kernel exceeds some count
  12. There is a bound on the number of splits taken which bounds the number of possible stationary policies
  13. A bound is given on the policy mistake count (PMC)
  14. The manner of computing the upper value function is a contraction operator
Advertisements
Tagged , ,

2 thoughts on “Ideas from Adaptive-Resolution Reinforcement Learning with Effcient Exploration in Deterministic Domains: Andrey Bernstein, Nahum Shimkin

  1. Michael Littman says:

    Are these notes from a paper? Also, it sounds like it has some overlap with Ornomeit and Sen. Are you familiar with that work?

  2. hut3 says:

    Yes, the authors and title are in the tags but i just noticed they dont show up when reading the post with this skin – I’ll add that to the title.

    I’m not aware of the Ornomeit and Sen paper (although I think I’ve heard it mentioned when I wasn’t aware of the context), so I’ll check it out.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: