reinforcement learning with factored states and actions. Sallans Hinton. jmlr 2004

  1. Covers VFA for high dimensional state andexpe space
  2. Does the approximation as “… Negative free energies in an undirected graphical model called a product of experts
  3. Action selection is done by mcmc
  4. In one experimental result action spaces are 2^40
  5. VFA and action selection are two separate issues discussed here
  6. ” Our approach is to borrow techniques from the graphical modeling literature and apply then to the problems of value estimation and action selection.”
  7. Looks like considers pomdps.
    1. Oh looks like they use inference methods for pomdps to infer state value
  8. In their system computing “free energy” which I think is equivalent to value is tractable while doing action selection isn’t so mcmc is used
  9. Mdps considered here are very large but finite
  10. Looks like they do VFA according to TD updates
  11. For value approximation use a “particular kind of product of experts, called a restricted Boltzmann machine (…).”
  12. “Boltzmann machines are undirected models. That means that the model satisfies joint pro probabilities, rather than conditional probabilities.”
    1. DBNs on the other band are directed
  13. The reason for using undirected models is that value estimation in that case is tractable which isn’t the case in directed models (the inference is hard)
    1. A directed graph also restricts use to an actor-critic model
  14. In Boltzmann machines there are visible and hidden nodes with symmetric weights pairwise between all vertices
    1. Hidden nodes don’t have a fixed value so the approach is to consider all possible settings of hidden variables
    2. The probability of settings of hidden nodes is according to Boltzmann distribution
    3. Not taking extensive notes on this though
    4. Finding the equilibrium of the Boltzmann machine is done by mcmc.
  15. Boltzmann machines are called product of experts as each hidden node is called an expert and the values are simply products between nodes
  16. Exploitation with Boltzmann rule I think it works out simply from the Boltzmann machine itself
  17. The large action task they mentioned in the beginning is very smooth and not sure it has a sequential component
  18. There is another multiagent task they test in

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: