- According to PATH INTEGRAL POLICY IMPROVEMENT WITH COVARIANCE MATRIX ADAPTATION, this is the first paper that uses cross entropy methods for policy search.
- Survey of policy gradient methods at:
- Experiments with Infinite-horizon, Policy Gradient Estimation. JMLR 2001

- Connection between stochastic optimization and Q-Learning made in:
- Asynchronous Stochastic Approximation and Q-learning. Tsitsiklis. Machine Learning 1994.

- Standard stochastic optimization is slow because of constraints in the way the update can be performed
- The approach here (cross-entropy), on the other hand is designed to be fast, and functions differently than standard stochastic optimization
- For background on Cross-Ent optimization, see:
- A tutorial on the Cross-entropy method. de-Boer, Kroese, Mannor, Rubinstein. Annals of OR
- The Simulated Entropy Method for Combinatorial and Continuous Optimization. Rubenstein. Methodology and Computing in Applied Probability 1999.

- “Good” samples in cross ent are usually referred to as elites
- Claim the method is generally robust as long as sample sizes are large and spread out at beginning, some additional smoothing may be used (too fast convergence seems to be an issue that comes up in some of the literature)
- A variant is proposed that is more adaptive in choice of generation size and how many samples are elites
- In this paper, cross ent is used in a finite MDP to generate a stochastic policy
- Here they care about finding a global policy and update policy of every state based on the history after that state was visited during a trajectory
- In section 4 they discuss parameterized policies (as opposed to ones that just encode p(a|s) in a matrix)
- When using cross entropy this (the normal, parametrizing a policy) way, mu(|s, theta) doesn’t have to be differentiable w.r.t. theta, which is not true in policy gradient algorithms
- Alg is tested empirically in a gridworld as well as inventory control (in this problem, the policy is at what stock level is each item re-ordered)
- Say that other methods of optimization can be used, but many are sensitive to sampling error (by that I think he means noise). Says that gradient methods as well as simulated annealing and guided local search all have this problem)

Advertisements
(function(g,$){if("undefined"!=typeof g.__ATA){
g.__ATA.initAd({collapseEmpty:'after', sectionId:26942, width:300, height:250});
g.__ATA.initAd({collapseEmpty:'after', sectionId:114160, width:300, height:250});
}})(window,jQuery);
var o = document.getElementById('crt-311818664');
if ("undefined"!=typeof Criteo) {
var p = o.parentNode;
p.style.setProperty('display', 'inline-block', 'important');
o.style.setProperty('display', 'block', 'important');
Criteo.DisplayAcceptableAdIfAdblocked({zoneid:388248,containerid:"crt-311818664",collapseContainerIfNotAdblocked:true,"callifnotadblocked": function () {var o = document.getElementById('crt-311818664'); o.style.setProperty('display','none','important');o.style.setProperty('visbility','hidden','important'); } });
} else {
o.style.setProperty('display', 'none', 'important');
o.style.setProperty('visibility', 'hidden', 'important');
}
var o = document.getElementById('crt-207418822');
if ("undefined"!=typeof Criteo) {
var p = o.parentNode;
p.style.setProperty('display', 'inline-block', 'important');
o.style.setProperty('display', 'block', 'important');
Criteo.DisplayAcceptableAdIfAdblocked({zoneid:837497,containerid:"crt-207418822",collapseContainerIfNotAdblocked:true,"callifnotadblocked": function () {var o = document.getElementById('crt-207418822'); o.style.setProperty('display','none','important');o.style.setProperty('visbility','hidden','important'); } });
} else {
o.style.setProperty('display', 'none', 'important');
o.style.setProperty('visibility', 'hidden', 'important');
}