For this experiment, I used WEKA’s implementation of Gaussian Processes to determine its performance in the Robocode domain. Initially, I was hopeful that performance would be good as it would be possible to model the mechanics of the game more accurately. Specifically, I could now model exactly how much energy was lost when the agent was hit, and exactly how much damage was done when a shot hit the enemy (since Robocode allows for shots of varying power). Performance was actually poor. When including the impact of missed shots (energy lost when shooting that is not regained), performance surprisingly was even poorer (but I won’t mention any more about that experiment here, as results were similar enough).
A large issue is the cost (in terms of both time, and memory) of the GP algorithm. For these experiments, I used WEKA’s implementation, which is most likely a pretty good one. Even still, on my machine (2gb ram, 2 core), it could only handle a maximum of 2,500 samples before running out of memory, and training took about 15 minutes between epochs. The J48 decision tree algorithm used for the other experiments, on the other hand, could easily handle tens of thousands of sample points.
Since the GP algorithm cannot handle a large number of sample points (I suppose its the n^3 costs of matrix operations), I had to shorten the epochs used in these experiments. In the experiments used with the J48 algorithm, the epochs were 1,000 rounds. Here, I had to reduce that to 100 rounds, because 1,000 rounds would produce more points than could even be used, and that would mean much learning couldn’t even be accomplished after the first epoch.
I ran for a total of 7 epochs, because after that most of the data found couldn’t be used anyway (more than 2,500 sample points per advisor). When I ran the experiments, if there were more than 2,500 samples for each advisor, I randomly discarded samples until there were only 2,500 samples.
Below are the graphs of the values WEKA’s GP algorithm found based on x, y coordinates and advisor. The values below are the result of training data for the final epoch.
(Click on the images for full-resolution version. Note that the range of values on each graph are different, even though the color range is the same for all graphs.)
Using these values, the following policy is found:
As you can see, the policy found here looks somewhat “right,” but has some flaws. The general characteristic is to use WallBot in the middle of the stage and then to transition to TrackFire when closer to the walls is good. There are, however, some problems which turn out to be quite significant. Firstly, there is a relatively large amount of area where SpinBot is the preferred advisor, and in this domain SpinBot should not be used at all. Secondly, the TrackFire sweet spot is a very thin linear band along each wall. As you can see, the regions where GP prefers TrackFire are somewhat curved, and tend to bulge in the corners. This leads to the policy generally transitioning either to the wrong advisor (SpinBot), or transitioning to the “correct” advisor (TrackFire), but doing so too early (outside of the sweet spot), which leads to very poor performance.
Basically, it was unable to find the sweet spot.
A plot of the performance metrics over epochs is below:
What seems to me to be the problem with this approach is the “fuzziness” of it. Based on the coloring of the value plots, it is clear there is a slow transition of values between nearby points; nearby points have similar values. The problem in this domain is that points which are very close in terms of their cartesian coordinates can actually have very different characteristics. Also, the boundaries found here should be linear, but GP finds boundaries which are curved, which is again a manifestation that nearby values have similar scores (and also perhaps evident in the “bulging” of TrackFire in the corners, as TrackFire goodness from both walls combines in the corners).
Specifically, the difference between being in, or outside of the sweet spot is literally a few pixels. If “goodness” from a location in the sweet spot bleeds outside of the sweet spot, it causes the agent to transition from, say WallBot to TrackFire too far away from the wall.
Oddly, one of the “problems” with classic decision trees, which is that they only draw axis-aligned boundaries actually happens to be the correct way to develop a policy in this domain. In other more “well behaved” problems, this may turn out to be a defficiency in decision trees and an advantage for GP; the opposite is the case here, however.
The other seemingly large problem was the efficiency of the algorithm, which was mentioned above. If it could use more sample points to learn from it may be possible learn the correct policy. By the 7th epoch there were already more samples taken than could be used, and this was only the result of 700 rounds of training. For the experiments with J48 on the other hand, 700 rounds was only 7/10 of the way through the first epoch.
Playing with the parameters may also help the problem. For example, reducing the “smoothness” of the value functions for each advisor may allow for a better policy, but based on these results it seems that at best the performance will be equivalent to that found with decision trees.