Hillcar with qPI and Weka

So I used my fitted Q-iteration code to build a policy for hillcar which is not optimal.  When run it took about 250 steps to reach the goal (in this particular implementation, optimal seems to be just under 100 steps).  For the experiments where the data was collected, noise was added to the position (normally distributed with μ=0, σ=0.01) because otherwise everything would be deterministic and all the trajectories that followed the policy alone would be identical.

For the experiments with pi_e (for exploration), pi_e was set to repeat the last action with probability 0.97, and otherwise to switch actions (bang bang control was used).  pi_e was used only at the beginning of a trajectory, and was used for a number of steps uniformly distributed between 20 and 100, and that data was logged as well.

50 trajectories were recorded in each setting, where every trajectory lasted from the start until the terminal state was reached.

Hillcar without Pi_e

Hillcar without Pi_e

Hillcar with Pi_E

Hillcar with Pi_e

The coloring is for the action taken – blue action corresponds to a left movement, red is to the right.

ARFFs are available: Hillcar without pi_e and Hillcar with pi_e.

I think overall the data actually looks pretty similar, except there is more of a blue swirl on the region immediately to the left of center in the data that uses pi_e.  Maybe I should have left pi_e on longer.   You can see part of the suboptimality of the policy is that it gets hung up going left on the leftmost end of the valley.

After that, I wrote fitted Q-iteration algorithm using Weka.  The entire program is just 150 lines, and took just an afternoon to write.  In particular I used their artificial nerual nets.  Since this is not an averager, it isn’t safe to use in algorithms like q-iteration, and in this instance the value function seems to doubling every iteration, so clearly something is wrong.  At any rate, this was meant more as a proof of concept than anything, and it shows that leveraging existing supervised learning frameworks can be a win.  Aside from that, Weka’s visualization tools can be very valuable for quickly inspecting data and results.  Here’s what the results of q-iteration look like for hillcar:

Hillcar value function, blue is low value, yellow is high value

Hillcar value function, blue is low value, yellow is high value

So there it is.  I hope to have data from Acrobot sometime during the next week.

2 thoughts on “Hillcar with qPI and Weka

  1. Michael Littman says:

    Thanks! Sorry I hadn’t looked at this data sooner. It’s been awhile, but I think what we were going for here was:

    * Without pi_e, a learning algorithm can fit the data, but the resulting value function is not good for policy improvement.

    * With pi_e, the data is sufficient for policy improvement.

    A similar result is in one of Parr and Koller’s papers, I think.

    I can’t tell from the plots whether they demonstrate this effect, the opposite, or something separate.

    I suspect I’m not making it clear (it’s not entirely clear to me either), so we might want to arrange for a call with Ron.

    It would be great if we could submit a paper on this stuff as well as the great learning stuff you’re doing.

  2. hut3 says:

    I see your point – I think in this case the policy learned may have “overfitted” the training data, so the policies were quite brittle. The reason I think this is the case is that simply running pi with the addition of noise lead to a great deal of variance in the amount of steps needed to reach the goal. I would think that a more “robust” policy would be less impacted by a small amount of stochasticity.

    But adding noise and adding pi_e are sort of analogous, and both probably help policy improvement.

    Setting up a conference call would be great, I think I would like to be more solid as to the WHOOT stuff first though since time is in tight supply right now with all the coursework. I think there was some relevant deadline in February? I think I could hit that.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: