- Interested in inferring bounds on finite horizon return of a policy from an off-policy sample of trajectories
- Is this relevant to qPi stuff?
- Assumes dynamics, policy, and reward are deterministic and Lipshitz
- Does not require trajectories; only requires SARSes
- Discuss an algorithm very similar to Viterbi to identify the the sequence of SARSes leading to the best bound, tightness of bound is also covered
- So is qPi solved then, no qPi is the other way around.
- Bound is related to alpha*
*C*, where*C*is some constant and alpha is the maximum distance between any element of the state-action space and its closest SARS sample - Policies considered can be time-dependent but the domain itself is time-invariant
- Everything in domain is Lipshitz (with known constant), but dynamics are unknown
- Goal is to find a lower bound on the return over T steps in an MDP for any policy from any start state
**This paper finally proves that the value function is Lipshitz if the MDP is as well.**I’ve seen many many papers make this claim, but this is the first I’ve seen that proves it.- Flipping equations can give upper bounds instead of lower bounds.
- “…the lower (and upper) bound converges at least linearly towards the true value of the return with the density of the sample”
- “The proposed approach could also be used in combination with batch-mode reinforcement learning algorithms for identifying the pieces of trajectories that influence the most the lower bounds of the RL policy and, from there, for selecting a concise set of four-tuples from which it is possible to extract a good policy.”