Ideas From: RL in Continuous Action Spaces. Van Hasselt, Wiering

  • Don’t love this paper but I have seen it cited a bit.  Perhaps because of the tiny amount of literature on the subject, or maybe I’m missing something.
  • Describes the CACLA, Continuous Actor-Critic Learning Automaton
  • Algo is online, model free, TD based
  • Proposes using gaussian exploration as opposed to epsilon-greedy
  • In ACLA (the non-continuous version of the algorithm) the probability of selecting an action increases by a fixed amount if the sign of the TD error is positive, otherwise it is unchanged
    • Claim this is good because it is invariant to scaling of reward function
    • Propose potentially updating it by more if the error is large relative to the variance of past errors
  • Uses neural nets as FAs for Q function, and then uses wire fitting to interpolate between this.  Uses 9 wires.
    • I have no idea what wire-fitting does
  • Claim that due to the nature of the wire fitting/FA, the estimated optimal action is always represented by a wire so no interpolation is needed
  • As is the case with model free/TD systems, the learning takes many data points to learn

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: