In the last post, HOOT (HOO applied to trees) was implemented, which is an algorithm designed to work in discrete-state, continuous action MDPs. The next extension to HOOT is that since the statistics computed are essentially counts and means in each node, sample weighting can be leveraged easily. We decided to call this extended algorithm “weighted HOOT,” or WHOOT.

To test this algorithm, I implemented a “plane world,” where the state is a 2-dimensional real (x,y location), and the action is a 1-dimensional real (angle of movement). At each step, the agent moves in unit length in the angle specified and motion is clipped at specified borders of the domain. For WHOOT, weighting was done based on RBFs on the state space.

I was running WHOOT in this domain, but it was a bit tricky to tune the parameters to HOO correctly; the classic exploration/exploitation tradeoff takes place. With parameters too small, the algorithm will sometimes find a great policy quickly, or it may quickly settle in on a very poor sub-optimal policy. Setting the parameters larger of course means more exploration, so policy convergence is slow. From my experience, it is pretty difficult to find the “sweet spot” for the parameters, and its safer to leave it exploration-biased than exploitation-biased.

MichaelK was looking at the performance, and suggested that it may be a good idea to use separate computations for moving in the simulated world as opposed to the “real world”. More specifically to use the formula originally from HOO (which is the sample mean plus bias terms which come from counts and the parameters to the algorithm which cause exploration) to compute which action to take while doing the simulated rollouts, but to use only the sample means when choosing which action to take in the real world. Right now we are calling this “Greedy WHOOT”. I compared WHOOT, Greedy-WHOOT, random, and optimal policies in the planeworld environment:

Plots here are in the same style as in previous posts, which is one 200-step rollout from each state before an actual action is chosen. The domain is an 8×8 plane-world with unit length movement, the goal was at (6,6) and required the agent to get at least 1.5 units from the goal when penalties ended in a terminal state. Parameters used for HOO were the same in Greedy-WHOOT and in WHOOT

As you can see, WHOOT does about as well by the end of its first trajectory as it does by any point later on; performance is not bad by comparison to the random agent, but does not show any improvement and is quite noisy (due to the high degree of exploration being done). Greedy-WHOOT, on the other hand is doing very well (just 2 steps off of optimal) by the 10th run, and virtually all runs after that are optimal or near-optimal after that point.

My interpretation of this is that since the algorithm has very good greedy performance quickly, it is probably also settling to a very good estimation of the q-function quickly as well. This means that the performance metric suggested by Ali of a plot of error of estimated value compared to optimal value would probably work quite well for this algorithm.

Although this modification pushes the algorithm closer into the planning side than learning, allowing the algorithm to distinguish between planning and learning does seem to have significant performance benefits. Its something we’ll need to look into going forward, but seems like to make a sale with the best performance, it should be presented as a planning algorithm. Chris and I have been talking a bunch, so there are plenty of other ideas going around which we can talk about when we meet up next. We’ve also been doing a bunch of coding and have implemented the test domains (Chris is working on binary action search already), so things seem to be moving along nicely.

i can’t quite tell the two colors apart in the graph. from the last meeting, have you changed how you are running these tests? what’s the status?

The general way the tests are run is the same, the parameters have changed slightly to make the domain smaller since the continuous state/action problem is quite a bit harder (or perhaps incomparable).

Aside from the parameter changes, the only difference from last meeting is that I show the greedy performance *and* the performance of the algorithm that keeps doing its regret-style performance.

I can tell the colors apart when I zoom in, but my kids agree that the shades are pretty similar.