RL algorithm for POMDP Problems. Jaakkola, Singh, Jordan

  • Clearly, for POMPDs
  • Converges to local optima
  • Allows for stochastic policies (can be necessary for POMDPs)
  • Uses notation that isn’t defined anywhere, so the paper isn’t really readable.
  • Uses a discount factor that changes during the course of the calculation of the value function that converges to 1?  I’m not getting it.
  • Seems to function in a manner very similar to policy iteration, except the stochastic policy only moves ε towards the estimated optimal action at that state, instead of just locking exactly to the estimated optimal action at that state
  • I don’t really see how this approach removes the problem of non-stochastic policies breaking in POMDPs, as it seems if you let this run for enough iterations of policy improvement, it will give you something very close to the pure policy regular policy iteration will give you anyway.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: