- Referenced in PATH INTEGRAL POLICY IMPROVEMENT WITH COVARIANCE MATRIX ADAPTATION as an early application of cross-entropy to continuous spaces
**Although this paper works in discrete action spaces.**- Alg looks for best closed-loop policy that can be represented using a given number of basis functions, where a discrete action is assigned to each basis function (type and # of BFs are specified a priori)
- Compare against the optimization algorithm DIRECT (interesting)
- Has citations that says value-function approximation generally requires expert design, or leverages basis functions because it has nice theoretic properties (but poor actual performance).
**Attempts have been made to adaptively form basis functions for VF approximation, but this can lead to convergence issues.** **Also has citations that say for policy search methods, functions that define policy are generally either ad-hoc or require expert design**- Funny that the algorithm can decide where to put the BFs and what action to select from them, but only selects from a discrete set. Seems trivial to move to continuous actions from here (may see why that is tough in a minute)
- Their test domains are the double integrator, bicycle, and HIV
- It is compared against LSPI and fuzzy Q, as well as DIRECT optimization of the basis functions
**Actor-critic methods perform both gradient-based policy optimization as well as value function approximation**- Gradient methods also assume reaching the local optimum is good enough, but in some cases there are many local optima which are not good
- This is particularly problematic when the policy representation is rich (many RBFs) as opposed to frugal (few linear functions)

- There are other cited methods for finding basis functions for VF approximation that they use as inspiration for doing the same for policies
**Convergence of Cross-Entropy is not guaranteed, although in practice it is generally convergent**- Although for discrete optimization the probability of reaching optimum can be made arbitrarily close to 1 by using an arbitrarily small smoothing parameter

- Argue here that the policy is easier to represent than the value function in many cases
- Compared to value function approximators with equally spaced basis functions, CE required fewer BFs, but this is natural – did they compare to VFAs that use adaptive basis functions (they cited it)
- Adaptive basis functions allows it to work better in high dimensional spaces

- They give a shout to doing continuous action stuff with the algorithm