Below are two images which are visualizations of data which was logged during trials against Wallbot in Robocode. The axes on both images are the X, Y axis of the actual battlefield.
The first image shows where the agent prefered to use each advisor, where notably red corresponds to trackfire. As you can see, the prefered location to use trackfire is very close to, but not exactly along the wall, which is exactly where the sweet spot against wallbot is. Interestingly, it didn’t seem to recognize that there is a sweet spot along the upper wall as well.
The second image indicates where the recorded (but as we learned, not necessarily actual) damage ratio changed. Blue indicates up, and red indicates down. Observe the very high correlation with red points above (where trackfire was used) and the blue points below (where the damage ratio was expected to go favorably for the agent. Just as the red band along the top wall is absent above, similarly there was no found high reward band below.
Interestingly, over the iterations, the sweet spot started at the lower left corner and “grew out” over the iterations to cover the entire lower and left walls, and then up the right wall towards the end. I think with even more training, it would have found the upper sweet spot as well.
These results came from a policy that only observed the advisor and current x, y coordinates as the state. Performance hovered at about a 25% win rate, and score at about 35%. It seems quite clear that the agent is finding the sweet spot in most places (outside at the top, as mentioned). So why isn’t it doing better? I have a few ideas (still not giving up)!
- It doesn’t “know” that it can use wallbot to get out of the center. Since one decision tree is being used for all three advisors, whenever it asks which advisor to use in the center, the decision tree doesn’t differentiate between the three, and just says “bad,” and so all three are given the same score (which I know doesn’t make sense; wallbot should be used in the center so it can enter that dangerous region). I plan on switching to a separate decision tree for each advisor, so I can know how bad any point is for all three advisors independently, so there is a low probability of a tie (there is a particular method im using to tie-break but in practice its pretty close to uniform, which is not good here). In the center trackbot and spinbot should end up with much lower scores than wallbot since they basically hang out in that bad area.
- Instead of using x, y coordinates as state, I plan on just using distance to the nearest wall, and perhaps also distance to the nearest corner (even though my guess is that isn’t totally necessary). That way it can generalize very easily and there shouldn’t be missing bands of goodness as occurred in this experiment.
Once I get those experiments up, I’ll have the results posted. I think there still may be some decent results to be found with tweaks to the representation and data analysis.