So I used my fitted Q-iteration code to build a policy for hillcar which is not optimal. When run it took about 250 steps to reach the goal (in this particular implementation, optimal seems to be just under 100 steps). For the experiments where the data was collected, noise was added to the position (normally distributed with μ=0, σ=0.01) because otherwise everything would be deterministic and all the trajectories that followed the policy alone would be identical.
For the experiments with pi_e (for exploration), pi_e was set to repeat the last action with probability 0.97, and otherwise to switch actions (bang bang control was used). pi_e was used only at the beginning of a trajectory, and was used for a number of steps uniformly distributed between 20 and 100, and that data was logged as well.
50 trajectories were recorded in each setting, where every trajectory lasted from the start until the terminal state was reached.
Hillcar without Pi_e
Hillcar with Pi_e
The coloring is for the action taken – blue action corresponds to a left movement, red is to the right.
ARFFs are available: Hillcar without pi_e and Hillcar with pi_e.
I think overall the data actually looks pretty similar, except there is more of a blue swirl on the region immediately to the left of center in the data that uses pi_e. Maybe I should have left pi_e on longer. You can see part of the suboptimality of the policy is that it gets hung up going left on the leftmost end of the valley.
After that, I wrote fitted Q-iteration algorithm using Weka. The entire program is just 150 lines, and took just an afternoon to write. In particular I used their artificial nerual nets. Since this is not an averager, it isn’t safe to use in algorithms like q-iteration, and in this instance the value function seems to doubling every iteration, so clearly something is wrong. At any rate, this was meant more as a proof of concept than anything, and it shows that leveraging existing supervised learning frameworks can be a win. Aside from that, Weka’s visualization tools can be very valuable for quickly inspecting data and results. Here’s what the results of q-iteration look like for hillcar:
Hillcar value function, blue is low value, yellow is high value
So there it is. I hope to have data from Acrobot sometime during the next week.