- Its basically about a safer version of policy search
- Whats wrong with current methods of policy search? They present a couple of very simple/easy MDPs where policy gradient methods take an extremely long (exponential) amount of time to solve.
- Requires what they call a restart distribution, which is weaker than a generative model, gives samples of next states from a chosen distribution, which allows the algorithm to get data from states that normally would not be visited
- Requires a black box greedy policy chooser which picks what it thinks is the best action based on the current policy. The trick here is that the greedy policy chooser isn’t always followed
- Algorithm is called conservative policy iteration which has the following properties:
- Policy is improved in a more uniform manner over the state space, by incorporating exploration allowed by restart distribution
- A more conservative policy update is done which keeps some of the current policy and improves the policy of the other states, which is supposed to alleviate problems of policy degredation that arise from approximate methods
- Results are for an algorithm that is near-optimal that requires a small number of steps independent of the state space size
- Policy generated is stochastic
- Seems to be for discrete spaces?
- They want answers to the following questions:
- Is there a performance measure that is guaranteed to improve at each step?
- How difficult is it to verify if a particular update improves this measure? (This seems most related to Qpi)
- After a reasonable # of policy updates, what is the performance level?
- Argue that greedy DP and policy gradient methods dont have good answers to these questions
- Their arguments against greedy DP aren’t very convincing, but they do say there isn’t enough theoretical results on the approach to say that it is
*good*w.r.t. to the three issues above - At least policy gradient is always guaranteed to improve (I’m not sure this is so imporant, thats why we care about explore-exploit). Somewhat related to this, they say the problem with gradient methods is that because of lack of exploration, a very large number of samples is required to estimate gradient direction.
- Give argument against good performance of any on-policy method (gradient is one of these) in simple hallworld-domain. Issue is that the gradient is almost always zero unless the goal is found, which leads to random (undirected) wandering, which makes it hard to find the goal, so the gradient seems to be zero everywhere unless a really huge amount of wandering is done (say importance sampling exists, but isnt very helpful)
- Have an even worse example in a trivial 2-state MDP
- Basic argument is that policy gradient methods may require data from hard to reach states based on current policy to improve policy
- Trick is to optimize the policy wrt a more global distribution over start states as opposed to the real distribution over start states, it makes sure you optimize all states
- Discuss policy improvement w.r.t. gradient, give a bound in expected benefit of doing policy improvement, the greater the policy advantage the greater the performance increase
- Use an epsilon greedy policy chooser, policy improvement no longer occurs when the greedy chooser no longer grants an advantage
- Conservative Policy Iteration Algorithm:
- Get epsilon-greedy policy
- Estimate the advantage (if this is a required step, I don’t think this paper is relevant to Qpi. Nevermind, they talk about how to estimate this)
- If advantage is small stop
- Else update policy and loop
- Says optimal policy advantage of pi is less than 2 epsilon (but this variable is overloaded, so I’m not sure which they mean)
- Algorithm terminates in polynomial time
- In order to make an epsilon-optimal accurate estimate of the advantage, a certian number of restarts are done and I guess the result is the mean of the trajectories, use Hoeffding bound
- I think the meat of where this is relevant to Qpi is on page 6
- Show with proof that using a uniform starting distribution is more helpful than the actual starting distribution to compute the advantage
- Quick discussion of algorithm used with function approximators

Advertisements
(function(){var c=function(){var a=document.getElementById("crt-924828737");window.Criteo?(a.parentNode.style.setProperty("display","inline-block","important"),a.style.setProperty("display","block","important"),window.Criteo.DisplayAcceptableAdIfAdblocked({zoneid:388248,containerid:"crt-924828737",collapseContainerIfNotAdblocked:!0,callifnotadblocked:function(){a.style.setProperty("display","none","important");a.style.setProperty("visbility","hidden","important")}})):(a.style.setProperty("display","none","important"),a.style.setProperty("visibility","hidden","important"))};if(window.Criteo)c();else{if(!__ATA.criteo.script){var b=document.createElement("script");b.src="//static.criteo.net/js/ld/publishertag.js";b.onload=function(){for(var a=0;a<__ATA.criteo.cmd.length;a++){var b=__ATA.criteo.cmd[a];"function"===typeof b&&b()}};(document.head||document.getElementsByTagName("head")[0]).appendChild(b);__ATA.criteo.script=b}__ATA.criteo.cmd.push(c)}})();
(function(){var c=function(){var a=document.getElementById("crt-567673971");window.Criteo?(a.parentNode.style.setProperty("display","inline-block","important"),a.style.setProperty("display","block","important"),window.Criteo.DisplayAcceptableAdIfAdblocked({zoneid:837497,containerid:"crt-567673971",collapseContainerIfNotAdblocked:!0,callifnotadblocked:function(){a.style.setProperty("display","none","important");a.style.setProperty("visbility","hidden","important")}})):(a.style.setProperty("display","none","important"),a.style.setProperty("visibility","hidden","important"))};if(window.Criteo)c();else{if(!__ATA.criteo.script){var b=document.createElement("script");b.src="//static.criteo.net/js/ld/publishertag.js";b.onload=function(){for(var a=0;a<__ATA.criteo.cmd.length;a++){var b=__ATA.criteo.cmd[a];"function"===typeof b&&b()}};(document.head||document.getElementsByTagName("head")[0]).appendChild(b);__ATA.criteo.script=b}__ATA.criteo.cmd.push(c)}})();