Re: Followup Robocode Experiment

I wrote alot here, and I’m not sure if I focus too much on minutia.

Let me know what you are looking for: more/less details? Graphs? Insight?  I’m just trying to make sure I give you the information you are looking for, with our swamping you with minor details.

Your words are green, mine are black, I indented to show nested responses, but this is probably obvious.


Comments as I read:

> The first image shows where the agent prefered to use each advisor,
> where notably red corresponds to trackfire.

Can you describe how the agent was trained?  If we’re going to try
to tie this baby off, we need to bring the whole story together.

Sure, its pretty simple.  For the first epoch, at the beginning of the round, and every so often (to be precise, every 50 units of time in the game), an advisor would randomly be chosen.  After the first epoch, I had an actual decision tree that scored each state, and used that to decide which agent to use.

In this experiment, I was still using one decision tree for all advisors, which I eventually figured out was a problem.  As I mentioned in the post, bad areas were classified as bad for all advisors, since there was one decision tree for all, all advisors recieved the exact same score.  In that situation I had to randomly tiebreak (which was essentially uniform, even though there was a heuristic), which lead to a lot of non-optimal behavior (such as choosing spinbot or trackfire from the center, for example).  That random tie-breaking also lead to alot of what I would call “thrashing” behavior-switching back and forth between advisors pretty much with no positive effect.  This is what held performance to the ~25% win rate until I switched to independent decision trees for each advisor.

At any rate, data was logged, and the decision tree was built from data from all the epochs that occured, not only the most recent one.

> Interestingly, it didn’t seem to recognize that there is a sweet
> spot along the upper wall as well.

On the up side, it’s obvious we didn’t *force* it to learn the
sweet spot if it only learned half of it…

> Interestingly, over the iterations, the sweet spot started at the
> lower left corner and “grew out”



> After the last training epoch, the policy had converged completely,
> as determined by the decision tree.

What does that mean?  How does the decision tree record
convergence?  Or do you mean that there were two epochs in a row with
the same tree?  (With the noisiness of the data, I’m not sure that we
could prove that the thing truly converged.)

The latter – at some point the decision tree stopped changing, so if the tree didnt change after the addition of 1,000 new rounds worth of data, I considered it converged.

> By the 6th epoch, score, damage caused, and win rates are all better
> than WallBot.

Excellent.  Can we show it is statistically better?  I think we’d
need a bunch of runs and a statistical test.

Since I got better results with the next experiment (which I’m assuming you didn’t read at this point), I’ll consider this moot?  Let me know if you would like to see some further statistical tests on the results of this experiment or of the next experiment, which seems significantly better so that a rigorous statistical test may not be necessary.

> I attempted to add more state variables (such as last recorded
> opponent distance, which is important for TrackFire), and it
> actually reduced the win rate from over 50% to about 25%.  I suppose
> it may be a case of overfitting?

A likely explanation.  I’d also check to make sure the other
features didn’t get corrupted somehow.

Thanks, I did some manual inspection, and everything seemed kosher…

> Interestingly, the policy found by the decision trees is extremely
> short, and I will post it below:

Beautiful, thanks.  That’s the Q function.

Ok, cool!  That wasn’t obvious to me, since the algorithm doesn’t explicitly factor in the expected discounted future reward, or difference between expected and actual reward.

Let’s see what the
value function / policy looks like (the max, assuming I’m reading
things right):

disToWall <= 17.98183:
disToCorner <= 263.925892: down 0.7172 [TrackFire]
disToCorner <= 264.569543: up 3.4848 [Wall]
else: down 0.7172 [TrackFire]
disToWall <= 35.515157:
disToCorner <= 263.925892: up 0.7242 [TrackFire]
disToCorner <= 264.569543: up 3.4848 [Wall]
else: down up 0.7242 [TrackFire]
disToCorner <= 263.925892: down 0.7227 [Wall]
disToCorner <= 264.569543: up 3.4848 [Wall]
else: down 0.79524 [SpinBot]

Based on the visualization below, I think you can basically simplify this to:

If disToWall < 35: use trackFire
Else: use wallBot

Which is very simple, of course.

It can probably be cleaned up a bit more, but it’s pretty nice.  Are
the distance measures (disToCorner, disToWall) in the same units?
That is, is it always closer to a wall than to a corner?

Correct.  I guess to be precise the wall distance is <= corner distance always.

It might be useful to plot the policy on a grid (three colors for
the three advisors).  Can you do that?

Absolutely, here is a plot of the policy converted to the x, y space of the arena.

I’m not sure why the graph is a little off center – I checked my math and it looks right, and I dont think its weka’s fault, so I’m not sure what the issue is.  Since its based of wall and corner distances, each quadrant of the graph should be symmetric.  Basically, the distance to wall part of the state space is the dominant aspect relative to which advisor to use, it seems the distance to corner only adds a very thin stripe, which is probably superficial; I doubt it would resurface if I ran the experiment again.

At any rate, the general characteristing (including the odd stripe) can be confirmed by viewing the plot of the policy with the x-axis as distance to the wall, and y-axis as distance to the corner:

> The “sweet spot” lights up in blue.

I thought the sweet spot was near the corner, not all the way
around the wall

It can be along the wall as well.  Basically here’s the dynamics:  Wallbot will glide along a wall with its gun perpendicular to the wall.  If it happens to see the opponent as its sliding by the wall, it will fire and continue moving.  If however, it sees the opponent and Wallbot is in the corner (that is, Wallbot’s opponent is against the wall Wallbot is about to transition to), it will stop in that spot (the corner) and continue to fire until the opponent moves out of the way.  If you are Wallbot’s opponent it is a good strategy to hang out just far enough away from the wall so that Wallbot won’t just stop and fire at you.

If you are off the wall by a very small amount Wallbot will keep moving, and may even bump into you, meanwhile, you can hit it the whole time.  For this reason the sweet spot isnt only off the corners.

> The approach used in this experiment yielded the best performance so
> far (slightly better than the immediate previous experiment):

Killer!  Nice work.

I’m glad you are happy – I wasn’t sure whether to be satisfied with the results or not :)

> In this case, the policy ended up being pretty simple, so as in the
> last post, I’ll include the policy here:

Can you plot it for me?  Is it discovering the channel near the

Absolutely, here is a plot of the policy found based on this approach.  Although the dimensions of the arena are 800×600, the robot can only be located in the range from ( 18, 18 ) to ( 582, 782 ), because each half of the robot takes up 18 units of space.

As you can see, based on the runs done, it didn’t find the sweet spot along the right wall.  I think this is because after the first epoch, there is essentially all exploitation with no exploration, so something that is deemed very bad after the first epoch may never be attempted again; simple changes to the algorithm could fix this of course.

I think that if it found the sweet spot along the right wall, performance would be further improved.  Also notice that the algorithm found that using trackfire was the best option as long as you are at or closer to the wall than the sweet spot.

Here is a visalization of the performance of only trackfire under this policy:

Note that the agent only stops along the wall on the right side (where it uses wallbot until it reaches the wall, then slides down, and then transitions to trackfire.  Along all other walls, the agent tries to transition to trackfire in the sweet spot and not against the wall, which is demonstrated by the agent not going to a location lower than 26 on either axis, and to a maximum of 567 (of a possible 582) along the top, but all the way to the maximum of 782 on the right.

Sometimes it ends up stopping close to the wall because I ended up changing advisors something like every 5 clock ticks; any less than that didn’t seem to work well.

And here is a visalization of the performance of only wallbot under this policy:

Likewise, here you can see that the agent transitions out of wallbot at about 50 units away from the wall, which allows it to stop in the sweet spot with trackfire.

And a quick video of the policy in action:

As you can see, the sweet spot is very thin; its easy to be either too far away from the wall (letting wallbot slip by), or to be too close to the wall (causing walbot to stop and fire at you).  It its best to be in between those two, so that it doesn’t stop to fire at you, but bumps into you along the way.

I’d like you to put together a report for Dan.  It should detail
what eventually worked, describing the algorithm, the data, the
learned policy, the performance, and any relevant statistical tests.
In a discussion section at the end, write your thoughts about how
things ended up compared to the Lockheed work we heard about.  LaTeX
would be great, although I’d tolerate Word for something short like

Ok, I think I’ll do it in Word then for the sake of expediency.  I’ll run it by you before I send it over. Can you open/edit word files?  I can export to pdf if you like.

Any thoughts aobut publishing any aspect of this work?  Did we
learn anything worth sharing with the world?  I *think* what we’re
doing is roughly the same as what Kaelbling did in her thesis.  Can
you take a look at

(That document) Decision Tree Function Approximation in Reinforcement Learning


(That document) Input Generalization in Delayed Reinforcement Learning: An Algorithm And Performance Comparisons

and write down some thoughts about how the algorithms (and results)
relate to each other?  (Being about to do this sort of comparison is
really important and it takes some practice.)

Very basically the “G” algorithm uses a decision tree to partition the state space, and then uses the leaf nodes in the decision tree in place of the partitions that are more normally used in an algorithm such as Q-Learning, where each dimension of the state space is uniformly partitioned into a number of cells that are (generally) chosen arbitrarily.   From what I could tell both papers used essentially the same algorithm, and the second paper is mostly empirical results of using that approach.

In the approach which we used in the robocode problem, we are doing classification, and we aren’t doing any type of estimation of future discounted reward explicitly anywhere.  The score for each node ranges from -1 to +1, based on whether the leaf classifies that state as improving the agent/opponent health ratio, and then the percent of samples which are in that leaf node that agree with that classification.

It seems like what was done in the Kaelbling paper would help fix the problem of our agent “shooting itself out,” as they penalize the agent for missed shots, which we did not do.  On the other hand, they only track this, so they are not assigning negative reward for coming into contact with a ghost (If that is even possible in that domain).

Aside: I did consider tracking shot accuracy, but a bug in Robocode prevented me from doing this. Basically I should have been able to make a queue and then track where shots were taken from and how they landed, but a bug in the cleanup code at the end of the round prevented me from doing this.

At any rate, another difference is that their decision tree is built on-line, while ours is built off line.  My guess is that the difference is somewhat significant, because decsion trees that are built batch-style on a corpus of data will be able to produce more efficient trees than trees built on-line (by which I mean trees built batch-style are probably shorter, and more efficiently split the states at each node).

Also, from my understanding they are throwing out data every time a split occurs, while we never throw out any data.  In the long run, this may not be very significant, and I think the G-Learning algorithm could maintain all history with a pretty simple modification.

The other difference is that we don’t factor in any next state data in our tree, just simple classification of a fraction ranging from -1 to 1, based on what has happened in the state space that leaf represents in the entire history.  That is we have no explicit function that looks like

Q(s, a) = r(s, a) + gamma*( max_a' [Q(s', a')] - Q(s, a)).

My hunch is this is a much more significant factor than the online vs. batch mode difference, since this means that values in the leaf nodes are being computed very differently, as opposed to the tree just having a different shape.  We are doing no such computation, just more of a statistical average of what was observed in each leaf.

I’m not sure there is anything to be published here (especially given these two papers), but I don’t have a good handle on what that means exactly yet.  If you disagree let me know.

Is that the type of analysis you were looking for?


I’m glad you are thinking creatively and actively.  I was hoping to
get you thinking about Bayesian stuff a bit, as I believe that will be
the next wave of work coming out of the lab over the next few years.

Sure, I’m not particularly stuck to any problem.

Can you take a look at this paper: ?

I haven’t read it yet, although I’ve spoken to a few of the
authors.  What I’d like us to do is work on a learner that can
construct rich transition models from priors, somewhat like what these
guys do with classification problems.

Ultimately, I’d like us to be able to learn to get through a whole
bunch of screens of pitfall (instead of be limiting to just the first
one).  I think one of the problems with the current pitfall work is
that we the people need to come up with the hypothesis space for
learning and we need to make it broad enough to cover the dynamics but
not so general that the learner can’t generalize.

A Bayesian perspective might be good since priors let you create a limited
“high likelihood” hypothesis space, but to also include a broader
range of models with low probability just in case the likely models
don’t cover things well.

As a concrete example, consider a learner facing environments like
this (inspired by Carlos’ mouse-wall example):

* The goal is behind a wall.
* Each segment of the wall has a color and a texture (say).
* The likelihood that the agent can push through a segment of wall
depends (probabilistically) on features of the segment, some of
which are visible and some not.
* For concreteness, let’s say 75% of the time color is the key, 15%
of the time texture is the key, and 10% of the time, hidden
features are the key.
* In the “classical” KWIK setting, we’re basically screwed.  In the
worst case, no generalization is possible, and since KWIK is a
worst-case framework, the agent is stuck checking every segment.
* A Bayesian algorithm ought to be smarter and try to generalize by
color first, then texture, then fall back to checking every

Sorry, but I’m feeling a little obtuse here.  What does this probabilistic DNF grammar have to do with this?  Are you saying we have priors over things like:

* what % color counts for
* what % texture counts for
* what % hidden features count for?

And that perhaps these priors are initially bad (perhaps uniform) and are then updated during the course of training?  Is it that simple, or does it go deeper than that?

I’m not aware of published algorithms that can reason like this, so
it seems like a fertile area to explore.

Ok, in general I don’t like Bayesian stuff where good priors are assumed a priori – it seems like cheating to me.  I also don’t like algorithms that assume a particular distribution over the data (such as Gaussian).

But, what is done in the paper seems very reasonable, since they assume all priors uniform, and then modify them as training occurs (if I understand it correctly), and only are dealing with a distribution over productions, which are modified during training.

Although I’m not particularly interested in whether the results match human performance, the paper is interesting because the mathematics behind it makes sense and is very reasonable.  I like it.

There are also some great opportunities for ‘transfer’ work in this
setting, as a Bayesian approach can modify its prior based on example
environments, allowing it to learn in a new environment much faster
than something that is always starting from scratch.  So, instead of
telling it that colors matters 75% of the time, this sort of bias can
come out of trying a bunch of different environments.

Ok, FYI I’m not more interested in transfer learning than anything else, I just thought of something that might be useful.  If its a good fit and its novel, I’ll be happy to look into it though.

I’m hoping we can start a dialog and settle on something concrete
to try over the summer to get some experience with this setting.

Cool, I’m all ears (eyes?).  I’m going to work on the writeup for Dan and then we can discuss this more.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: