- Interested in inferring bounds on finite horizon return of a policy from an off-policy sample of trajectories
- Is this relevant to qPi stuff?
- Assumes dynamics, policy, and reward are deterministic and Lipshitz
- Does not require trajectories; only requires SARSes
- Discuss an algorithm very similar to Viterbi to identify the the sequence of SARSes leading to the best bound, tightness of bound is also covered
- So is qPi solved then, no qPi is the other way around.
- Bound is related to alpha*
*C*, where*C*is some constant and alpha is the maximum distance between any element of the state-action space and its closest SARS sample - Policies considered can be time-dependent but the domain itself is time-invariant
- Everything in domain is Lipshitz (with known constant), but dynamics are unknown
- Goal is to find a lower bound on the return over T steps in an MDP for any policy from any start state
**This paper finally proves that the value function is Lipshitz if the MDP is as well.**I’ve seen many many papers make this claim, but this is the first I’ve seen that proves it.- Flipping equations can give upper bounds instead of lower bounds.
- “…the lower (and upper) bound converges at least linearly towards the true value of the return with the density of the sample”
- “The proposed approach could also be used in combination with batch-mode reinforcement learning algorithms for identifying the pieces of trajectories that influence the most the lower bounds of the RL policy and, from there, for selecting a concise set of four-tuples from which it is possible to extract a good policy.”

Advertisements
(function(g,$){if("undefined"!=typeof g.__ATA){
g.__ATA.initAd({collapseEmpty:'after', sectionId:26942, width:300, height:250});
g.__ATA.initAd({collapseEmpty:'after', sectionId:114160, width:300, height:250});
}})(window,jQuery);
var o = document.getElementById('crt-370575268');
if ("undefined"!=typeof Criteo) {
var p = o.parentNode;
p.style.setProperty('display', 'inline-block', 'important');
o.style.setProperty('display', 'block', 'important');
Criteo.DisplayAcceptableAdIfAdblocked({zoneid:388248,containerid:"crt-370575268",collapseContainerIfNotAdblocked:true,"callifnotadblocked": function () {var o = document.getElementById('crt-370575268'); o.style.setProperty('display','none','important');o.style.setProperty('visbility','hidden','important'); } });
} else {
o.style.setProperty('display', 'none', 'important');
o.style.setProperty('visibility', 'hidden', 'important');
}
var o = document.getElementById('crt-889830702');
if ("undefined"!=typeof Criteo) {
var p = o.parentNode;
p.style.setProperty('display', 'inline-block', 'important');
o.style.setProperty('display', 'block', 'important');
Criteo.DisplayAcceptableAdIfAdblocked({zoneid:837497,containerid:"crt-889830702",collapseContainerIfNotAdblocked:true,"callifnotadblocked": function () {var o = document.getElementById('crt-889830702'); o.style.setProperty('display','none','important');o.style.setProperty('visbility','hidden','important'); } });
} else {
o.style.setProperty('display', 'none', 'important');
o.style.setProperty('visibility', 'hidden', 'important');
}