Inferring bounds on the performance of a control policy from a sample of trajectories. Fonteneau, Murphy, Wehenkel, Ernst

  1. Interested in inferring bounds on finite horizon return of a policy from an off-policy sample of trajectories
    1. Is this relevant to qPi stuff?
  2. Assumes dynamics, policy, and reward are deterministic and Lipshitz
  3. Does not require trajectories; only requires SARSes
  4. Discuss an algorithm very similar to Viterbi to identify the the sequence of SARSes leading to the best bound, tightness of bound is also covered
    1. So is qPi solved then, no qPi is the other way around.
    2. Bound is related to alpha*C, where C is some constant and alpha is the maximum distance between any element of the state-action space and its closest SARS sample
  5. Policies considered can be time-dependent but the domain itself is time-invariant
  6. Everything in domain is Lipshitz (with known constant), but dynamics are unknown
  7. Goal is to find a lower bound on the return over T steps in an MDP for any policy from any start state
  8. This paper finally proves that the value function is Lipshitz if the MDP is as well.  I’ve seen many many papers make this claim, but this is the first I’ve seen that proves it.
  9. Flipping equations can give upper bounds instead of lower bounds.
  10. “…the lower (and upper) bound converges at least linearly towards the true value of the return with the density of the sample”
  11. “The proposed approach could also be used in combination with batch-mode reinforcement learning algorithms for identifying the pieces of trajectories that influence the most the lower bounds of the RL policy and, from there, for selecting a concise set of four-tuples from which it is possible to extract a good policy.”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: