Ideas From: RL in Continuous Action Spaces. Van Hasselt, Wiering

Don’t love this paper but I have seen it cited a bit. Perhaps because of the tiny amount of literature on the subject, or maybe I’m missing something.

Describes the CACLA, Continuous Actor-Critic Learning Automaton

Algo is online, model free, TD based

Proposes using gaussian exploration as opposed to epsilon-greedy

In ACLA (the non-continuous version of the algorithm) the probability of selecting an action increases by a fixed amount if the sign of the TD error is positive, otherwise it is unchanged

Claim this is good because it is invariant to scaling of reward function

Propose potentially updating it by more if the error is large relative to the variance of past errors

Uses neural nets as FAs for Q function, and then uses wire fitting to interpolate between this. Uses 9 wires.

I have no idea what wire-fitting does

Claim that due to the nature of the wire fitting/FA, the estimated optimal action is always represented by a wire so no interpolation is needed

As is the case with model free/TD systems, the learning takes many data points to learn