Followup Robocode experiment

Following the relative success of the previous experiment, which leveraged derived features, separate decision trees for each advisor, and a limited state representation, I wanted to try the experiment again. This time replacing the derived features (distance from corner, distance from wall) with just the x, y coordinates. I had tried this experiment previously, without the separate decision trees, and while there were some interesting trends in the policy found, the overall performance wasn’t great (about 25% win rate). The approach used in this experiment yielded the best performance so far (slightly better than the immediate previous experiment):

In this case, the policy ended up being pretty simple, so as in the last post, I’ll include the policy here:

SpinBot:

: down (0.7913)

TrackFire:

Y <= 35.998488
|   X <= 764.478481
|   |   X <= 733.191059: down (18203.0/7037.0)
|   |   X > 733.191059: up (602.0/255.0)
|   X > 764.478481: down (22431.0/9031.0)
Y > 35.998488
|   Y <= 53.926373
|   |   X <= 763.040059: up (9718.0/3662.0)
|   |   X > 763.040059: down (3806.0/1424.0)
|   Y > 53.926373
|   |   Y <= 539.757751
|   |   |   X <= 35.987964: down (11186.0/4284.0)
|   |   |   X > 35.987964
|   |   |   |   X <= 53.9907: up (8506.0/3312.0)
|   |   |   |   X > 53.9907: down (10950.0/2963.0)
|   |   Y > 539.757751
|   |   |   Y <= 563.967975
|   |   |   |   Y <= 546
|   |   |   |   |   X <= 693.230218: down (1643.0/693.0)
|   |   |   |   |   X > 693.230218: up (411.0/164.0)
|   |   |   |   Y > 546
|   |   |   |   |   X <= 308.951123
|   |   |   |   |   |   X <= 31.101764: down (494.0/184.0)
|   |   |   |   |   |   X > 31.101764: up (3798.0/1692.0)
|   |   |   |   |   X > 308.951123: up (7548.0/2576.0)
|   |   |   Y > 563.967975: down (14759.0/5700.0)

Walls:

: down (0.7342)

Most of my comments for this approach are actually the same as what went for the previous experiment (where distance to wall, corner was used). The behavior over epochs were quite similar, only convergence of this policy took longer, and with a better result in the end. The only interesting difference is that the agent again missed a sweet spot along an entire wall, and it didn’t seem like the agent was going to find it in this case (perhaps a very traumatic set of runs as trackfire along the right wall). My hunch is that while this approach seems to work relatively well, it does too much exploitation and not enough exploration, which could be improved with a simple epsilon-greedy approach I suppose.

Even with that difficiency, performance was better than any other previous method, and while it doesn’t consistently beat WallBot, looking at the numbers, it is clear that the learned policy is superior.

Update: video of policy in action

One thought on “Followup Robocode experiment”

Ideas for future work « A’s Research says:

June 24, 2008 at 12:35 pm

[…] Robocode – I don’t think we’ll be able to get results much better than I found in my last experiment. In my opinion they should be just as happy with a good robocode solution as a good hillcar […]

Ari Weinstein's Research