- Covers VFA for high dimensional state andexpe space
- Does the approximation as “… Negative free energies in an undirected graphical model called a product of experts
- Action selection is done by mcmc
- In one experimental result action spaces are 2^40
- VFA and action selection are two separate issues discussed here
- ” Our approach is to borrow techniques from the graphical modeling literature and apply then to the problems of value estimation and action selection.”
- Looks like considers pomdps.
- Oh looks like they use inference methods for pomdps to infer state value

- In their system computing “free energy” which I think is equivalent to value is tractable while doing action selection isn’t so mcmc is used
- Mdps considered here are very large but finite
- Looks like they do VFA according to TD updates
- For value approximation use a “particular kind of product of experts, called a restricted Boltzmann machine (…).”
- “Boltzmann machines are undirected models. That means that the model satisfies joint pro probabilities, rather than conditional probabilities.”
- DBNs on the other band are directed

- The reason for using undirected models is that value estimation in that case is tractable which isn’t the case in directed models (the inference is hard)
- A directed graph also restricts use to an actor-critic model

- In Boltzmann machines there are visible and hidden nodes with symmetric weights pairwise between all vertices
- Hidden nodes don’t have a fixed value so the approach is to consider all possible settings of hidden variables
- The probability of settings of hidden nodes is according to Boltzmann distribution
- Not taking extensive notes on this though
- Finding the equilibrium of the Boltzmann machine is done by mcmc.

- Boltzmann machines are called product of experts as each hidden node is called an expert and the values are simply products between nodes
- Exploitation with Boltzmann rule I think it works out simply from the Boltzmann machine itself
- The large action task they mentioned in the beginning is very smooth and not sure it has a sequential component
- There is another multiagent task they test in