So I included the ideas which I mentioned at the end of my previous post, which were:
- Making a different decision tree for each advisor
- Adding 2 new (“derived”) variables to the state representation, being distance to nearest wall, and distance to nearest corner.
I’ll cut right to the results and then give an explanation:
Where the y-axis represents the goodness ratio for each metric for the agent relative to WallBot, and the x-axis represents training epochs (1st being completely random). For example, the green triangle at (2, 0.6) represents (rounds agent won)/(total rounds). After the last training epoch, the policy had converged completely, as determined by the decision tree.
Some interesting things to note:
- I needed to sample 3,000 matches under the random policy before Weka would generate a decision tree larger than one leaf node (all “bad”) for each advisor. For all subsequent epochs, the duration was 1,000 matches. All data was merged, so each epoch had available all previous data when generating decision tree, not just previous epoch.
- On epoch #2 (as labeled in chart), based on the decision tree, it believed that spinbot was the best choice in almost all situations, which resulted in the very poor performance during that epoch
- On epoch #3 (as labeled in chart), based on decision tree, it ceased to used SpinBot all together, based on the poor performance of SpinBot in the previous training epoch. SpinBot was never again used at all.
- Policy seemed to be mostly converged by the 4th epoch, but continued to change slightly up until the 6th epoch.
- By the 6th epoch, score, damage caused, and win rates are all better than WallBot.
- I attempted to add more state variables (such as last recorded opponent distance, which is important for TrackFire), and it actually reduced the win rate from over 50% to about 25%. I suppose it may be a case of overfitting?
The behavior of the converged policy is consistent and always uses WallBot at the beginnging of the the round, until it gets near or at the wall, at which point it switches to TrackFire for the duration of the round. SpinBot is not used at all. This is pretty close to what I would hand-code for a policy. Interestingly, the policy found by the decision trees is extremely short, and I will post it below:
disToWall <= 17.98183: down 0.7172
disToWall > 17.98183
| disToWall <= 35.515157: up 0.7242
| disToWall > 35.515157: down 0.8641
disToCorner <= 263.925892: down (0.7227)
disToCorner > 263.925892
| disToCorner <= 264.569543: up (0.777)
| disToCorner > 264.569543: down (0.789)
Interestingly, TrackFire only cares about the distance to the nearest wall, while WallBot only cares about the distance to the nearest corner.
The actions which resulted from this policy have the following interesting visualization, which is the damage ratio change experienced based on x, y coordinate, when using trackfire. The “sweet spot” lights up in blue. Note that x, y coordinates were not used as part of this policy:
A similar plot for data logged using SpinBot, or WallBot do not show patterns that are as consistent as this, and this visualization tends to look much more “random”.
At any rate, we finally have an algorithm and a policy that has a slight edge over WallBot, which I am fairly pleased with considering how simple the policy found is, and how imperfect our representation is of what is occuring in the actual game (energy lost from each shot isn’t modeled, nor is energy gained from hit shots). Additionally, this was done using much less data than Dan&Co. was using to craft his solution.
Tomorrow, I am going to try and use the approaches of the last experiment (using x, y location as state, as opposed to distances to wall, corner) and this experiment (different decision trees for each policy) and see what the result is. I’m expecting performance will be as good, or worse than this, but I’ll have word soon.