## Estimating the order of an infinite state MDP with Gaussian Processes

This idea came to me the evening after our meeting on Wednesday. For some reason I was thinking about modeling the ghost movement in Pac-Man when I thought of it, but it seemed like it may be well suited to Pitfall! (and hopefully many other things as well).

Here’s the idea in a few sentences: Given that you are observing state, action transitions of some MDP with no knowledge of the underlying MDP, how would you go about estimating the order of the MDP? Maybe its reasonable to try using standard regression techniques, and try regressing a number of functions, and see which does best. What I mean by a number of functions is build each regressor to use a different amount of history (the most recent time steps), and then whichever regressor performs best tells us an estimated order of the MDP.

Yes, using past frames is definitely a way to make better predictions.

Cool – is this general idea already a well-known / explored one?

Using history to make up for missing state information?  Sure.

Lets go back to Pitfall! for a second – at a couple of meetings I was at we were at we discussing how to model jumping, and the more we thought about it, the more complicated the idea got. As I recall the basic thing we were saying was, well if the emulator gives you one snapshot of the guy in the air, how do you know where he’ll be when he lands?

It seems to me like thats a very restricted way of looking at the problem, and that if you look at not just the most recent snapshot, but also a number of previous snapshots, it should be pretty simple to figure out where the trajectory is going to take Pitfall Harry when he lands. Thats because you can’t estimate velocity with just one sample – if you see Pitfall Harry in the air he could just as easily be jumping straight up or forward – there’s no way of knowing.

Right? So more generally, for some moving body, consider that you are given snapshots every tick, and the goal is to guess where its going to be next. The way I see things, given the data we have, position is an order 1 variable, but velocity is an order 2 variable. So given the most recent snapshot, and the one before, we know both where it is and where its going. Would having a history of the past 10 snapshots help? Not really if the dynamics are simple (I’m using this as a concrete example, but I think this is more generally applicable).

So, I wanted to test the idea out, and since it seemed like it lended itself well to the Pitfall! work, I figured I’d hack up a simple physics engine and track the movement of a body which has dynamics basically as follows:

• It can “run” to the left or the right, doing so repeatedly will keep increasing velocity until some maximum velocity is reached.
• There is friction, so if its going quickly to the right and tries to go left, it will continue going right, just slower. Exerting still more force to the left will eventually make it move left.
• Simple gravity, it can “jump” if it is currently on the floor. While in the air, trying to run or jump don’t work.
• There are walls, if it hits the wall it just stops there.

Cool.  Although, since you hacked it up, it violates the RL3 prime directive.  Nonetheless, it makes it easier to collect data fast at this stage, so I’m not objecting.

Fair enough – right I did this more as a proof of concept, I’d be happy to try it on things not of my own doing in the future, I was thinking aside from pitfall, stocks might also be an interesting application?  Just an idea, could be anything.

I’d be more inclined to some other game.  Stocks are a bit of a

Haha – fair enough.

I had the agent behave randomly, with the following behavior: Each tick go left, right, or jump each with 5% probability. This actually gives pretty good and varied movement. I logged the position and action at each time tick, where the action is the force exerted to the left, right, or up (even if it was in the air and that force did nothing). I have it all nicely animated to keep me entertained.

Once I had all this logged, I assembled GP regressors, one to predict the X position and one to predict the Y position. I made a number of them, from what I will call “order” 0 to order 5 (although I have a feeling I’m abusing this term). Anyway, the value representation for each order goes as follows:

• Order 0: you know nothing
• Order 1: current values of [x, y, force_x, force_y]
• Order 2: everything in order 1, and also those values for the step preceding whats in order 1
• Order 3: everything in order 2, and also those values for the step preceding whats in order 2
• etc

So, how much data do you need to most accurately predict where the agent will be next? Clearly in order 0 you are hosed. As I explained above, order 1 is pretty tough, because you don’t have any idea of velocity (but you do know your change in velocity from the action), and order 2 gives you what you need. My prediction was that eventually more isn’t better, so there should be some bowl shaped mean error function over order – too little isn’t good because there’s no way to do the estimates, but too much is also bad because your training points may not have something close to what you just saw (imagine an order 50 regressor, good luck finding a training point similar to that), making estimation tough.

Or “data sparsity”, more generally.

So, here’s the show me the money section:

The x component is red, the y component is green. Both the train set and the test set were 2000 samples large.

mean absolute error, x-axis is order of regressor, y is error

mean squared error, x-axis is order, y-axis is error^2

Ok, that order 0 regressor sucks. Lets re-plot that starting at order 1 so we can actually see whats going on.

mean absolute error, order 1 to 5

mean squared error, order 1 to 5

Now lets make it pop by summing both x and y errors:

x + y mean absolute error

x + y mean squared error

So that pretty clearly shows that order 2 is the best way to go in terms of both mean absolute error and MSE. I claim success.

And to tie it up, the variance:

mean GP variance over sample set

Which I interpret as the higher the order, the more difficult it is to find similar points, which makes sense because the state space is exploding. This explains why you can’t just throw the kitchen sink at the problem and call every MDP an order 1000 MDP and just have it work. With an order too small you cant do the estimate, but with an order too high the variance just causes lots of trouble. That is evident in the apparent superlinear growth of error as the actual order of the MDP is exceeded.

Great job—really interesting insight, intriguing results.  Do you really think the GP part is helping here, or would any learning algorithm have done more or less the same thing?

Thank you very much.  I originally did think that GP would be special for this, and that I would be able to use the variance to plot these curves instead of the error, but that was due to a poor understanding of GPs on my part.  It turns out the variance has nothing to do with the variability of the value you are trying to estimate, but rather just the distance of the query point from what its already seen, which is why the variance for both the x and y GPs are identical, even though the x value is more difficult to predict than the y variable (because gravity makes the agent spend most of the time on the floor).

I’ll try this out with other regression algorithms just to confirm.

I left a comment on the page suggesting that maybe it’s time to get some data from Pitfall! and try running on that.  It might also be worth comparing to some other approach so we can see if GP is
responsible for the success or if you just set up the problem so well that any method would succeed.

I think both would be great.  I mentioned the general idea to Carlos before I had results and he seemed interested to see what the results would be.  I’ll try some other applications in domains I didn’t craft up and see how things go.

I’m wondering if we should try again to form a Pitfall! team and really do something dramatic…

I would be totally pumped.  I think I’m gonna try and work on the Pitfall logs in simulation next and see if it finds that Harry’s location is not useful to help predictions, in simulation for now, but hopefully I can get the emulator running (or at least some data) and then try it there.

I’d think Carlos or Andre could give you traces from the crocodiles or jumping or vine swinging—something with non-Markovian behavior you could try to model.

But aren’t these things Markovian if represented by an MC with high enough order?  For example, if i just look at the crocodiles now, I can’t really figure much of anything out, but given enough history I should be able to estimate almost perfectly how the mouths open and close – I just need the order of the MC to be long enough to capture one period of the mouth opening and closing?

Sorry, I was speaking informally.  I just meant fully observable, Markov order 0.

Yes, there are several levels at work here, although I’m not sure
quite what to call them:

* non-Markov, non-stationary: transitions can’t be captured modeled using “states”.
* infinite-dimensional POMDP: A weird class that can be modeled by PSRs but probably not learnable or useful.
* finite-state POMDP: Can be modeled by an underlying Markov process that is not completely observable.
* finite-order Markov: Special kind of POMDP where the most recent k observations can be used as state.
* Markov, completely observable: Most recent observation can be used as state (k=1).

Something else I think is cool is that if you can learn these models, John and Istvan are thinking about how to do the planning part.  We could glue all the pieces together and do something really neat.

That sounds excellent.

Update:

I should also point out I believe this will solve other problems in Pitfall, such as the ability to figure that log movement isn’t conditional on player movement, and that scorpion movement is.  Its basically just a technique that says what variables help predict a function and what don’t.  Additionally, I thought the ability of the GP to report variance at a point would help, but it really doesn’t, since its the same for both x and y trained GPs.

## 5 thoughts on “Estimating the order of an infinite state MDP with Gaussian Processes”

1. István says:

Cool! It’s a lottle strange that going from 1 to 2 does not reduce the x-position error. Do you have an idea why?

Random idea: when you’ll apply this kind of prediction in Pittfall, you would probably need to do predictions multiple steps ahead (like “if I jump now, where will I fall?”). So it would make sense to train/test on multiple-step predictions, too. What do you think of that? I suspect that in that case, going up to order 3 (or maybe more) would still help.

István

2. Get some pitfall data! let’s see what it does. Andre? Carlos? Can you get stuff off one of the dynamic screens for Ari to play with?

3. hut3 says:

István: that is pretty strange now that you point it out, I don’t have a really good explanation to be honest with you. I feel like I threw enough data at it for 2-step prediction to work well. Do you have any guess?

As for the real Pitfall environment, I’m sure you are right that more will need to be done. I’m hoping, however, to be able to get something like this to work fairly generally so that it won’t have to relearn dynamics on every new screen. It would be cool to learn the dynamics of jumping and walking into a hole once and to then be able to remember that later on.

Surely, some good planning will need to be used, because as it is, it would be useful only for very local planning, such as avoiding obstacles.

4. István says:

I have no idea either, why is order 1 better than order 2 for the X dimension… but probably it is not a very important question, anyway :-)

As for the Pitfall environment, I was also thinking of something general (not screen-specific), like predicting the whole parabola of the jump. There is a need for something like this: a jump is a single action, extending to multiple time steps, and we’d like to predict its outcome. For multi-step predictions, higher-order observations may help.

Another silly question for your next blogpost (comments are closed there :-) Both SVM and the Perceptron seem to reach lower MSE than GP, so I did not quite get your reasoning, why is GP your favorite? OK, oscillations of the Perceptron are a bit strange, but they are still below GP. And I would say that the slowly growing error of SVM is a positive thing :-)

5. hut3 says:

Both you and Michael brought up the last point.

I think I used a bit of strange logic on my part in the next post in that I liked GPs because they seemed to highlight exactly the right amount of data.

But you are right, being that SVMS exhibited similar behavior (in that they at least demonstrated when there was too little data), the fact that the error grows slowly is certainly a good thing. I may switch over to that for future experiments in fact.

Even still, I think whichever regression algorithm is used (and it certainly could be something aside from GPs or svmReg), its probably good to just use the minimal amount of data necessary only so that the processing goes faster. Of course, for some algorithms (such as LinReg), the increasing order doesn’t slow them down much. But for anything that does matrix inversions, the extra size really causes pain.