Experiments with Infinite-horizon, Policy Gradient Estimation. JMLR 2001

Connection between stochastic optimization and Q-Learning made in:

Asynchronous Stochastic Approximation and Q-learning. Tsitsiklis. Machine Learning 1994.

Standard stochastic optimization is slow because of constraints in the way the update can be performed

The approach here (cross-entropy), on the other hand is designed to be fast, and functions differently than standard stochastic optimization

For background on Cross-Ent optimization, see:

A tutorial on the Cross-entropy method. de-Boer, Kroese, Mannor, Rubinstein. Annals of OR

The Simulated Entropy Method for Combinatorial and Continuous Optimization. Rubenstein. Methodology and Computing in Applied Probability 1999.

“Good” samples in cross ent are usually referred to as elites

Claim the method is generally robust as long as sample sizes are large and spread out at beginning, some additional smoothing may be used (too fast convergence seems to be an issue that comes up in some of the literature)

A variant is proposed that is more adaptive in choice of generation size and how many samples are elites

In this paper, cross ent is used in a finite MDP to generate a stochastic policy

Here they care about finding a global policy and update policy of every state based on the history after that state was visited during a trajectory

In section 4 they discuss parameterized policies (as opposed to ones that just encode p(a|s) in a matrix)

When using cross entropy this (the normal, parametrizing a policy) way, mu(|s, theta) doesn’t have to be differentiable w.r.t. theta, which is not true in policy gradient algorithms

Alg is tested empirically in a gridworld as well as inventory control (in this problem, the policy is at what stock level is each item re-ordered)

Say that other methods of optimization can be used, but many are sensitive to sampling error (by that I think he means noise). Says that gradient methods as well as simulated annealing and guided local search all have this problem)