Inferring bounds on the performance of a control policy from a sample of trajectories. Fonteneau, Murphy, Wehenkel, Ernst

Interested in inferring bounds on finite horizon return of a policy from an off-policy sample of trajectories

Is this relevant to qPi stuff?

Assumes dynamics, policy, and reward are deterministic and Lipshitz

Does not require trajectories; only requires SARSes

Discuss an algorithm very similar to Viterbi to identify the the sequence of SARSes leading to the best bound, tightness of bound is also covered

So is qPi solved then, no qPi is the other way around.

Bound is related to alpha*C, where C is some constant and alpha is the maximum distance between any element of the state-action space and its closest SARS sample

Policies considered can be time-dependent but the domain itself is time-invariant

Everything in domain is Lipshitz (with known constant), but dynamics are unknown

Goal is to find a lower bound on the return over T steps in an MDP for any policy from any start state

This paper finally proves that the value function is Lipshitz if the MDP is as well. I’ve seen many many papers make this claim, but this is the first I’ve seen that proves it.

Flipping equations can give upper bounds instead of lower bounds.

“…the lower (and upper) bound converges at least linearly towards the true value of the return with the density of the sample”

“The proposed approach could also be used in combination with batch-mode reinforcement learning algorithms for identifying the pieces of trajectories that influence the most the lower bounds of the RL policy and, from there, for selecting a concise set of four-tuples from which it is possible to extract a good policy.”