- Clearly, for POMPDs
- Converges to local optima
- Allows for stochastic policies (can be necessary for POMDPs)
- Uses notation that isn’t defined anywhere, so the paper isn’t really readable.
- Uses a discount factor that changes during the course of the calculation of the value function that converges to 1? I’m not getting it.
- Seems to function in a manner very similar to policy iteration, except the stochastic policy only moves ε towards the estimated optimal action at that state, instead of just locking exactly to the estimated optimal action at that state
- I don’t really see how this approach removes the problem of non-stochastic policies breaking in POMDPs, as it seems if you let this run for enough iterations of policy improvement, it will give you something very close to the pure policy regular policy iteration will give you anyway.