Batch Reinforcement Learning in a Complex Domain. Kalyanakrishnan, Stone

Just some quick notes on this one.

  • Compares batch-mode vs. online learning methods in a Robocup keepaway task (simulated it seems)
  • Use high-level actions such as “pass to player x”, as opposed to “turn, y degrees, pass to field location z”
  • Continuous state setting, they use a copule different FAs to represent Q(), ANNs and CMAC
  • In their experimental setting, experience replay (ER, a hack on an online method) with ANNs performed the best, but fitted Q-iteration (FQI) with ANNs and experience replay with CMAC were close
  • Even when the pure online methods were given 10x as much training data, they still performed more poorly than the above algorithms
  • Mention that some people favor using experience replay on a limited window of history, and that some algorithms may only converge when data is on-policy, which is nonstationary
  • Also, in ER there has been a noted phenomena of “over-training”, where performance actually degrades when experience is replayed too many times.  It was identified here when the #replays was > 10
  • They note ER is more sensitive to parameter settings that FQI in the setup used
  • ANNs seem to be very common as a FA for Q() even though we have math that tells us this is not a good idea.  That is one reas why I prefer tree-based FAs

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: