Sample-Based Learning and Search with Permanent and Transient Memories. Silver, Stutton, Muller

  • Paper describes Dyna-2.  Based on Dyna, which has similar architecture.
  • Uses both learning and search.  Learning is permanent and search is transient.
    • In Go, the most effective learning methods are based on TD (Sarsa) with linear FA
    • The most effective search are UCT and its variants
  • Permanent memory is used as an initial guide.  Transient memory can allow agent to figure out when configurations that are commonly bad are actually good, or what the special cases are
  • Learning occurs over all real experience from games, planning occurs from the current state and after planning that simulated information is discarded
  • Q() in both are based on linear FAs adjusted by Sarsa
    • FA is based on 1 million binary features
    • Uses this because the true state space is so big that its hopeless to represent
  • Using transient memory alone had performance comprable to UCT
    • Unclear whether it was better than UCT or not, seemed to be worse actually against other agents
  • Data in permanent memory can come from other sources, such as simulations

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: