## Randomly beating Pitfall!

Got the gold.

So doing lots of random rollouts where the backward/left actions aren’t allowed lets you get through pitfall pretty easily.  Almost every screen can be finished in under 1000 rollouts.

I tried using simple hillclimbing, but at Sergiu’s recommendation I turned it off.  Turns out this works better as hillclimbing seemed to get stuck in local optima.  Not sure how to present this for the paper, and whether I should play with other optimization methods, but for now I’m pretty happy.

## Pitfall without model learners

While I think model building in Pitfall is probably the most interesting aspect of the problem, it occurred to me this morning that it may not even be necessary in order to solve the actual problem.

The reason is that since we are running Pitfall in an emulator, we have the ability to store and load game states at any point (assuming I haven’t nuked that behavior by accident with all the edits I’ve made).  What this means is that although we do not have a true generative model, we do have everything any forward searching algorithm requires already.

For example, it should be simple to do the exhaustive type of rollout where we try every action from every state and find every possible state at some horizon by building a tree of depth equal to the horizon length, and fan out equal to the number of actions (just for arguments sake; I know this in particular isn’t practical).  All it is is a tree with the start state at the root, an action executed at each branch, and then the resulting true game state (which we actually have access to) in the following node, and so on.

From my perspective at this point it seems there are four interesting parts to the whole problem:

• Object detection: Recognize a wall, ladder, etc as such, in order to reduce the complexity of the problem for the learners.  This should be done totally automatically with no input about the number of objects in the game, what objects in the game look like, etc.
• Model learning: Building an actual model of the game dynamics.  This could be done without object detection, but my expectation is that without object detection this would be extremely difficult.  These predictions are use by the planner (possibly through a merger).
• Merging: If we have many learners, how do we take input from all of them and turn that into a coherent picture for the planner?
• Planning: How to get the reward.  My proposal is that right now we exploit the nice pieces of functionality an emulator provides to play the role of model building, but in the future actual learned models could be substituted.

Being that each of those three components by themselves are probably pretty challenging, I suppose it would be useful to think about the order in which to do them.  Although doing rollout and storing true game states somewhat obviates the need to have a nice representation, I think it would be nice to have it anyway (this would also allow us to snap in real learners with less difficulty later on).  Therefore, I think the order of work should be as follows:

1. Object Detection: Should make the work of everything else easier (or at least not any harder)
2. Planning: If we can’t get a planner to work with true predictions, the learners certainly aren’t going to be useful at all, so we need to lick this problem up front.
3. Learning: The icing on the cake, allows us to throw away a dependency on the save/load function of emulators, and leaves us with a very nice story if it all works out.
4. Merging: Only necessary in the case that we end up with multiple learners, and the learners themselves work.  Certainly not necessary if we are leveraging the emulator at first, since its always right.

This is may be a lot at once, does it make sense?

## Using entropy to find the order of a process

In a previous post, I wrote about using regression to infer the order of a process, which was motivated by trying to find an approach to solving Pitfall.

The idea was that the same approach could be used to not only find the order of a process, but also the causal structure, as would be described in a temporal Bayesian network.

But, after thinking about it, it seems like there are more principled ways of doing the same thing, especially because everything in Pitfall is integer (not real) valued, and deterministic given the right model.  In line with my attempts to sneak information theory into model building, it seemed like we could use entropy to decide whether or not to include a variable in the prediction of a target value.

In a gross simplification, I considered a simulation of the crocodile where the data is just a list of  {0,1}*, where 0 indicates a closed mouth and 1 an open mouth.  Assume that the sequence actually looks like {0^k, 1^k}*.  In this case, we know from inspection that k is the order, but how can entropy help us?

Consider the case where k = 5, so our sequnce looks like: [0,0,0,0,0,1,1,1,1,1,0,…]

If we are given no data, and need to predict the next state, basically we’re hosed.  We know from the data (imagine its a very long sequence, or that the length is some multiple of 10) that there is an equal probability of the next state being a 0 or 1.  In this case, as with an even coin, the entropy of prediction is = 1.

Now consider being given data which is the state at the present.  Consider the case it is 0.  If we are looking at one of the “first” 4 zeros in the sequence above, the next value will be 0.  If we are looking at the “fifth” zero in the sequence, then the next value will be 1.  In the case where the data is split (4/5), (1/5), the entropy is approximately 0.722.

You can imagine that as more history of the trajectory is given, the entropy will decrease further until it is 0.

I wrote up a quick experiment of the above scenario, and this is what the result looks like:

```Order: 0 Entropy: 1.0
Order: 1 Entropy: 0.721928094887
Order: 2 Entropy: 0.649022499567
Order: 3 Entropy: 0.550977500433
Order: 4 Entropy: 0.4
Order: 5 Entropy: 0.0
Order: 6 Entropy: 0.0```

Just as you would expect.  I think using information gain instead of the direct entropy will give you basically the same result (just with differences instead).

Similarly, consider another gross simplification of the vine swinging.  Think of this as a process that ranges back and forth between some minimum and maximum values.  In this case, it modeled it as a process that looks like: [0,1,2,3,4,3,2,1,0,…]

In this case, we know its a 2-order Markov process because if we know only the current value, we don’t know if it is increasing or decreasing, but with 2 steps of history we know both.  Running the same experiment on this process yields:

```Order: 0 Entropy: 2.25
Order: 1 Entropy: 0.75
Order: 2 Entropy: 0.0
Order: 3 Entropy: 0.0```

Note that the entropy in a discrete case is bounded by log_2(#possible values).  In this case, log_2(5) = 2.32.  The 2.25 is slightly lower because 0 and 4 occur less often than 1,2,3, so it is almost, but not quite uniform.

I think thats pretty cool, because it gives some more justification to basically the same thing I was doing earlier in a very ad-hoc manner.  Also, this is computationally cheaper than doing most forms of regression.  Of course, this can be used in an OO manner by attempting to measure the entropy not only by the order of the object itself, but also by the including the state of other objects.

## JStellAI Mark 1

So I’ve been working on improving the efficiency of the screen scraping for JStellAI.  Its been brought up to about 1/4 real time speed on my laptop, so on a powerful machine like Dice, my guess it will be somewhere between 1/2 speed to real time.  This is much faster than Stella AI, but of course the object recognition here is much more primitive.

I don’t think there is any way to make the screen scraping much more efficient than it currently is.  It currently just makes one copy of the video buffer, iterates over it once, from each location starting a bounding box if there isnt already one of the color at that point.  The only other thing I can think of is using the boolean arrays for each color, and making threads that do the bounding boxes on all the squares, but being that I expect the other cores to be busy with other duties for Moola, I think there are diminishing returns to be found there.  Aside from that, I’m going to look into other parts of the emulator and see if there is anything that is artificially slowing the clock or anything like that.

Additionally, I added some Pitfall-specific code to screen out the black border on the bottom and left side of the screen.  This is necessary since otherwise a black bounding box is made over the entire screen which makes it impossible to identify holes.   Aside from that there is nothing specifically for pitfall.

I’ve changed the display code to draw just the outlines of the rectangles because its the easiest way to see whats going on.  Another video below:

There are still a few frames that look a little funny – I need to inspect the log to see if its the java video tearing or an actual problem in the logging.  Aside from that I think this is more or less done.

As I’ve said before, I’m a bit concerned about the type of features this is producing. For example, I think that with this implementation, the ladder and the floor look like the same object, so learning the difference between the two seems like it may be tough, but I guess we can address that when/if it is actually an issue.

Next step is to get the input hacked up so I can actually control JStellAI programatically and output the proper logs that contain the actual actions.  And then once that is together I have the task of getting all of this together under Moola.

That being said, there is a pretty practical question to ask here, which is “Are we going to use this instead of the other stuff?”  I’d hate to keep putting time into this only to have it not being used in the end.

## JStellAI first video

Yesterday I finished implementing the basic pieces to have JStella ouput traces, yesterday I recorded a trace of going through the first 3 screens of the game.  Objects are rendered in the color they appear as in the game, and the rectangles are rendered with alpha transparency and also have a solid border of the same color:

The Good:

• The implementation is simple; uses only Java source, and standard Java libraries.  This means people should be able to get this running much more easily (why I started this in the first place).

• Seems to be about as slow as Stella AI (I’m trying to optimize the code now).
• There is lag time at the beginning before the controls become responsive; it seems to be a period where the game is loading.
• There are a few frames that are empty aside from the black background – I need to investigate why thats happening.
• Hangs after Harry dies?

Notice that the holes don’t show up.  That is simply because there is a black border on the left and bottom of the screen, so a black bounding rectangle is drawn over the entire screen so no other rectangles are drawn inside that.  The simple fix (I tried it already) is just to crop out the black border on the screen.

Also, the logs and vines are hard to see because of their close color to the other colors around/behind them, but they are there.  I specifically added the solid borders to make them easier to see.

So more work needs to be done, but at the very least I think we can end up with an implementation that is at least as fast, and much easier to run, when compared to the C/Lua Stella AI, and I also think it won’t have the issue of dropping objects that earlier versions of that seemed to have.  I think with a bit of work I can get it faster.

## Lets try this

Crude screen dump of 3rd screen in pitfall

So its been quite difficult for me, John, or Carlos to get Stella AI running on dice, or any of our laptops.  A big part of this is a number of dependencies on external libraries, and silly differences in naming of libraries etc etc etc…

Since Michael was interested in using a more honest (thats my word, not his) image segmenter that just cuts out bounding boxes around colors, I decided to change course a bit.  This approach will probably give us poorer features, but if we can get learning working with it, it will be more impressive and general for other games later.

Stella AI is written in C++ (from what I understand), a language I won’t use.  Another issue that seems to prevent Stella AI from playing nicely seems to be the use of Lua, which I don’t know anything about (apparently, there are scripting languages that aren’t named Python).  At any rate, I found a version of Stella implemented in Java (JStella).

Just today I downloaded the JStella source, added a few lines to the source, and am dumping out video memory information.  The picture above is super crude because video memory just has values of indices to the palette, and I just dumped those numbers out and drew a chart in Excel.  Regardless, you can see whats going on.  Yes I know it seems oddly tall, not sure why the buffer is like that, but its there; if i was going out of the buffer id get an out of bounds exception, of course.

So, I’m going to go ahead with this.  JStella is very nice because it needs no libraries aside from the stanard Java libraries, which should keep headaches to a minimum, which is pretty important in a project that has so many different pieces to keep track of.

## 2 Videos of Stella AI’s object segmentation

Note, there are still a few bugs:  The second screen in the first video should have 3 black boxes corresponding to holes instead of just one, and in the second video notice that the first few frames have missing floor.

Color code is as follows:
Green: trees,
Dark Green: Crocs
Dark Grey: Ground
Lavender: Scorpion
Pink: Harry
Brown: Vine
Red: Wall
Black: Hole
Tan: Text
Blue: Log

## Estimating the order with other regression algorithms

When you asked if the property that seemed to show up in the last experiment was something that I thought was particular to GPs, I said no, that I figured that other regression techniques should also perform similarly.

Well, thanks to Weka, by just changing one line of code I was able to perform that experiment with a number of different regression algorithms, and the results are mixed.

Just for reference, this was the result with GPs (all plots are mean absolute error):

GP error

Which looks very clean, and almost parabolic.  Now here are the corresponding plots for other regression algorithms:

#### Linear Regression:

Linear regression error, order 1 to 20

The order/error graph for LinReg is pretty well behaved, but oddly the error spikes at 2.  There seems to be some flat regions and a then high degree of change in a short period of time with some noise.  The minimum error is at order 9.

#### Perceptron:

Perceptron error vs order

This is the noisiest graph produced out of any of the regression algorithms I tried.  Even still there is a noticeable gain in error as order increases.  Like GPs, Perceptrons had the smallest error at order 2.

#### SVM Regression:

SVM Regression error, order 1 to 6

This one is interesting, the error is very low and flat for all values aside from at 1.  Even still, the error at order 2 is smallest than all other values I tried by a small amount.  I don’t know much about SVM regression though, so its tough for me to interpret this at all.

I also tried Isotonic Regression (I don’t know what that is either), and the results were almost perfectly flat and consistent, with error minimized at order 1, the error for order 1 to 14 were all between 13.7 and 13.9, so Isotonic Regression seems to give you nothing for this approach.

#### Thoughts:

Like GPs, SVM Regression and Perceptrons had error minimized at 2.  Isotonic had minimum error at order 1.  LinReg was quite far off at 9.  In terms of the lowest observed error over all algorithms at any one point Perceptrons actually performed the best.

SVM and Isotonic Regression had very slowly growing error as a function of order, so the fact that they perform well in those cases actually makes it poorly suited for this approach.

The general shape of the graph with LinReg and Perceptrons seemed to be similar to that demonstrated by GPs, but with a higher degree of noise, which I suppose also makes them less well suited for the task.

Based *only* on this experiment it seems like GPs may be best for this approach, as the graph looks the best behaved, and seemed to find the “right” value.  This is a highly qualified statement, in that the results here seem weak.  If this experiment could be reproduced with similar results in another domain, I would be more confident of that statement.

## Estimating the order of an infinite state MDP with Gaussian Processes

This idea came to me the evening after our meeting on Wednesday. For some reason I was thinking about modeling the ghost movement in Pac-Man when I thought of it, but it seemed like it may be well suited to Pitfall! (and hopefully many other things as well).

Here’s the idea in a few sentences: Given that you are observing state, action transitions of some MDP with no knowledge of the underlying MDP, how would you go about estimating the order of the MDP? Maybe its reasonable to try using standard regression techniques, and try regressing a number of functions, and see which does best. What I mean by a number of functions is build each regressor to use a different amount of history (the most recent time steps), and then whichever regressor performs best tells us an estimated order of the MDP.

Yes, using past frames is definitely a way to make better predictions.

Cool – is this general idea already a well-known / explored one?

Using history to make up for missing state information?  Sure.

Lets go back to Pitfall! for a second – at a couple of meetings I was at we were at we discussing how to model jumping, and the more we thought about it, the more complicated the idea got. As I recall the basic thing we were saying was, well if the emulator gives you one snapshot of the guy in the air, how do you know where he’ll be when he lands?

It seems to me like thats a very restricted way of looking at the problem, and that if you look at not just the most recent snapshot, but also a number of previous snapshots, it should be pretty simple to figure out where the trajectory is going to take Pitfall Harry when he lands. Thats because you can’t estimate velocity with just one sample – if you see Pitfall Harry in the air he could just as easily be jumping straight up or forward – there’s no way of knowing.

Right? So more generally, for some moving body, consider that you are given snapshots every tick, and the goal is to guess where its going to be next. The way I see things, given the data we have, position is an order 1 variable, but velocity is an order 2 variable. So given the most recent snapshot, and the one before, we know both where it is and where its going. Would having a history of the past 10 snapshots help? Not really if the dynamics are simple (I’m using this as a concrete example, but I think this is more generally applicable).

So, I wanted to test the idea out, and since it seemed like it lended itself well to the Pitfall! work, I figured I’d hack up a simple physics engine and track the movement of a body which has dynamics basically as follows:

• It can “run” to the left or the right, doing so repeatedly will keep increasing velocity until some maximum velocity is reached.
• There is friction, so if its going quickly to the right and tries to go left, it will continue going right, just slower. Exerting still more force to the left will eventually make it move left.
• Simple gravity, it can “jump” if it is currently on the floor. While in the air, trying to run or jump don’t work.
• There are walls, if it hits the wall it just stops there.

Cool.  Although, since you hacked it up, it violates the RL3 prime directive.  Nonetheless, it makes it easier to collect data fast at this stage, so I’m not objecting.

Fair enough – right I did this more as a proof of concept, I’d be happy to try it on things not of my own doing in the future, I was thinking aside from pitfall, stocks might also be an interesting application?  Just an idea, could be anything.

I’d be more inclined to some other game.  Stocks are a bit of a

Haha – fair enough.

I had the agent behave randomly, with the following behavior: Each tick go left, right, or jump each with 5% probability. This actually gives pretty good and varied movement. I logged the position and action at each time tick, where the action is the force exerted to the left, right, or up (even if it was in the air and that force did nothing). I have it all nicely animated to keep me entertained.

Once I had all this logged, I assembled GP regressors, one to predict the X position and one to predict the Y position. I made a number of them, from what I will call “order” 0 to order 5 (although I have a feeling I’m abusing this term). Anyway, the value representation for each order goes as follows:

• Order 0: you know nothing
• Order 1: current values of [x, y, force_x, force_y]
• Order 2: everything in order 1, and also those values for the step preceding whats in order 1
• Order 3: everything in order 2, and also those values for the step preceding whats in order 2
• etc

So, how much data do you need to most accurately predict where the agent will be next? Clearly in order 0 you are hosed. As I explained above, order 1 is pretty tough, because you don’t have any idea of velocity (but you do know your change in velocity from the action), and order 2 gives you what you need. My prediction was that eventually more isn’t better, so there should be some bowl shaped mean error function over order – too little isn’t good because there’s no way to do the estimates, but too much is also bad because your training points may not have something close to what you just saw (imagine an order 50 regressor, good luck finding a training point similar to that), making estimation tough.

Or “data sparsity”, more generally.

So, here’s the show me the money section:

The x component is red, the y component is green. Both the train set and the test set were 2000 samples large.

mean absolute error, x-axis is order of regressor, y is error

mean squared error, x-axis is order, y-axis is error^2

Ok, that order 0 regressor sucks. Lets re-plot that starting at order 1 so we can actually see whats going on.

mean absolute error, order 1 to 5

mean squared error, order 1 to 5

Now lets make it pop by summing both x and y errors:

x + y mean absolute error

x + y mean squared error

So that pretty clearly shows that order 2 is the best way to go in terms of both mean absolute error and MSE. I claim success.

And to tie it up, the variance:

mean GP variance over sample set

Which I interpret as the higher the order, the more difficult it is to find similar points, which makes sense because the state space is exploding. This explains why you can’t just throw the kitchen sink at the problem and call every MDP an order 1000 MDP and just have it work. With an order too small you cant do the estimate, but with an order too high the variance just causes lots of trouble. That is evident in the apparent superlinear growth of error as the actual order of the MDP is exceeded.

Great job—really interesting insight, intriguing results.  Do you really think the GP part is helping here, or would any learning algorithm have done more or less the same thing?

Thank you very much.  I originally did think that GP would be special for this, and that I would be able to use the variance to plot these curves instead of the error, but that was due to a poor understanding of GPs on my part.  It turns out the variance has nothing to do with the variability of the value you are trying to estimate, but rather just the distance of the query point from what its already seen, which is why the variance for both the x and y GPs are identical, even though the x value is more difficult to predict than the y variable (because gravity makes the agent spend most of the time on the floor).

I’ll try this out with other regression algorithms just to confirm.

I left a comment on the page suggesting that maybe it’s time to get some data from Pitfall! and try running on that.  It might also be worth comparing to some other approach so we can see if GP is
responsible for the success or if you just set up the problem so well that any method would succeed.

I think both would be great.  I mentioned the general idea to Carlos before I had results and he seemed interested to see what the results would be.  I’ll try some other applications in domains I didn’t craft up and see how things go.

I’m wondering if we should try again to form a Pitfall! team and really do something dramatic…

I would be totally pumped.  I think I’m gonna try and work on the Pitfall logs in simulation next and see if it finds that Harry’s location is not useful to help predictions, in simulation for now, but hopefully I can get the emulator running (or at least some data) and then try it there.

I’d think Carlos or Andre could give you traces from the crocodiles or jumping or vine swinging—something with non-Markovian behavior you could try to model.

But aren’t these things Markovian if represented by an MC with high enough order?  For example, if i just look at the crocodiles now, I can’t really figure much of anything out, but given enough history I should be able to estimate almost perfectly how the mouths open and close – I just need the order of the MC to be long enough to capture one period of the mouth opening and closing?

Sorry, I was speaking informally.  I just meant fully observable, Markov order 0.

Yes, there are several levels at work here, although I’m not sure
quite what to call them:

* non-Markov, non-stationary: transitions can’t be captured modeled using “states”.
* infinite-dimensional POMDP: A weird class that can be modeled by PSRs but probably not learnable or useful.
* finite-state POMDP: Can be modeled by an underlying Markov process that is not completely observable.
* finite-order Markov: Special kind of POMDP where the most recent k observations can be used as state.
* Markov, completely observable: Most recent observation can be used as state (k=1).

Something else I think is cool is that if you can learn these models, John and Istvan are thinking about how to do the planning part.  We could glue all the pieces together and do something really neat.

That sounds excellent.

Update:

I should also point out I believe this will solve other problems in Pitfall, such as the ability to figure that log movement isn’t conditional on player movement, and that scorpion movement is.  Its basically just a technique that says what variables help predict a function and what don’t.  Additionally, I thought the ability of the GP to report variance at a point would help, but it really doesn’t, since its the same for both x and y trained GPs.