This idea came to me the evening after our meeting on Wednesday. For some reason I was thinking about modeling the ghost movement in Pac-Man when I thought of it, but it seemed like it may be well suited to Pitfall! (and hopefully many other things as well).
Here’s the idea in a few sentences: Given that you are observing state, action transitions of some MDP with no knowledge of the underlying MDP, how would you go about estimating the order of the MDP? Maybe its reasonable to try using standard regression techniques, and try regressing a number of functions, and see which does best. What I mean by a number of functions is build each regressor to use a different amount of history (the most recent time steps), and then whichever regressor performs best tells us an estimated order of the MDP.
Yes, using past frames is definitely a way to make better predictions.
Cool – is this general idea already a well-known / explored one?
Using history to make up for missing state information? Sure.
Lets go back to Pitfall! for a second – at a couple of meetings I was at we were at we discussing how to model jumping, and the more we thought about it, the more complicated the idea got. As I recall the basic thing we were saying was, well if the emulator gives you one snapshot of the guy in the air, how do you know where he’ll be when he lands?
It seems to me like thats a very restricted way of looking at the problem, and that if you look at not just the most recent snapshot, but also a number of previous snapshots, it should be pretty simple to figure out where the trajectory is going to take Pitfall Harry when he lands. Thats because you can’t estimate velocity with just one sample – if you see Pitfall Harry in the air he could just as easily be jumping straight up or forward – there’s no way of knowing.
Right? So more generally, for some moving body, consider that you are given snapshots every tick, and the goal is to guess where its going to be next. The way I see things, given the data we have, position is an order 1 variable, but velocity is an order 2 variable. So given the most recent snapshot, and the one before, we know both where it is and where its going. Would having a history of the past 10 snapshots help? Not really if the dynamics are simple (I’m using this as a concrete example, but I think this is more generally applicable).
So, I wanted to test the idea out, and since it seemed like it lended itself well to the Pitfall! work, I figured I’d hack up a simple physics engine and track the movement of a body which has dynamics basically as follows:
- It can “run” to the left or the right, doing so repeatedly will keep increasing velocity until some maximum velocity is reached.
- There is friction, so if its going quickly to the right and tries to go left, it will continue going right, just slower. Exerting still more force to the left will eventually make it move left.
- Simple gravity, it can “jump” if it is currently on the floor. While in the air, trying to run or jump don’t work.
- There are walls, if it hits the wall it just stops there.
Cool. Although, since you hacked it up, it violates the RL3 prime directive. Nonetheless, it makes it easier to collect data fast at this stage, so I’m not objecting.
Fair enough – right I did this more as a proof of concept, I’d be happy to try it on things not of my own doing in the future, I was thinking aside from pitfall, stocks might also be an interesting application? Just an idea, could be anything.
I’d be more inclined to some other game. Stocks are a bit of a
loaded concept right now.
Haha – fair enough.
I had the agent behave randomly, with the following behavior: Each tick go left, right, or jump each with 5% probability. This actually gives pretty good and varied movement. I logged the position and action at each time tick, where the action is the force exerted to the left, right, or up (even if it was in the air and that force did nothing). I have it all nicely animated to keep me entertained.
Once I had all this logged, I assembled GP regressors, one to predict the X position and one to predict the Y position. I made a number of them, from what I will call “order” 0 to order 5 (although I have a feeling I’m abusing this term). Anyway, the value representation for each order goes as follows:
- Order 0: you know nothing
- Order 1: current values of [x, y, force_x, force_y]
- Order 2: everything in order 1, and also those values for the step preceding whats in order 1
- Order 3: everything in order 2, and also those values for the step preceding whats in order 2
So, how much data do you need to most accurately predict where the agent will be next? Clearly in order 0 you are hosed. As I explained above, order 1 is pretty tough, because you don’t have any idea of velocity (but you do know your change in velocity from the action), and order 2 gives you what you need. My prediction was that eventually more isn’t better, so there should be some bowl shaped mean error function over order – too little isn’t good because there’s no way to do the estimates, but too much is also bad because your training points may not have something close to what you just saw (imagine an order 50 regressor, good luck finding a training point similar to that), making estimation tough.
Or “data sparsity”, more generally.
So, here’s the show me the money section:
The x component is red, the y component is green. Both the train set and the test set were 2000 samples large.
Ok, that order 0 regressor sucks. Lets re-plot that starting at order 1 so we can actually see whats going on.
Now lets make it pop by summing both x and y errors:
So that pretty clearly shows that order 2 is the best way to go in terms of both mean absolute error and MSE. I claim success.
And to tie it up, the variance:
Which I interpret as the higher the order, the more difficult it is to find similar points, which makes sense because the state space is exploding. This explains why you can’t just throw the kitchen sink at the problem and call every MDP an order 1000 MDP and just have it work. With an order too small you cant do the estimate, but with an order too high the variance just causes lots of trouble. That is evident in the apparent superlinear growth of error as the actual order of the MDP is exceeded.
Great job—really interesting insight, intriguing results. Do you really think the GP part is helping here, or would any learning algorithm have done more or less the same thing?
Thank you very much. I originally did think that GP would be special for this, and that I would be able to use the variance to plot these curves instead of the error, but that was due to a poor understanding of GPs on my part. It turns out the variance has nothing to do with the variability of the value you are trying to estimate, but rather just the distance of the query point from what its already seen, which is why the variance for both the x and y GPs are identical, even though the x value is more difficult to predict than the y variable (because gravity makes the agent spend most of the time on the floor).
I’ll try this out with other regression algorithms just to confirm.
I left a comment on the page suggesting that maybe it’s time to get some data from Pitfall! and try running on that. It might also be worth comparing to some other approach so we can see if GP is
responsible for the success or if you just set up the problem so well that any method would succeed.
I think both would be great. I mentioned the general idea to Carlos before I had results and he seemed interested to see what the results would be. I’ll try some other applications in domains I didn’t craft up and see how things go.
I’m wondering if we should try again to form a Pitfall! team and really do something dramatic…
I would be totally pumped. I think I’m gonna try and work on the Pitfall logs in simulation next and see if it finds that Harry’s location is not useful to help predictions, in simulation for now, but hopefully I can get the emulator running (or at least some data) and then try it there.
I’d think Carlos or Andre could give you traces from the crocodiles or jumping or vine swinging—something with non-Markovian behavior you could try to model.
But aren’t these things Markovian if represented by an MC with high enough order? For example, if i just look at the crocodiles now, I can’t really figure much of anything out, but given enough history I should be able to estimate almost perfectly how the mouths open and close – I just need the order of the MC to be long enough to capture one period of the mouth opening and closing?
Sorry, I was speaking informally. I just meant fully observable, Markov order 0.
Yes, there are several levels at work here, although I’m not sure
quite what to call them:
* non-Markov, non-stationary: transitions can’t be captured modeled using “states”.
* infinite-dimensional POMDP: A weird class that can be modeled by PSRs but probably not learnable or useful.
* finite-state POMDP: Can be modeled by an underlying Markov process that is not completely observable.
* finite-order Markov: Special kind of POMDP where the most recent k observations can be used as state.
* Markov, completely observable: Most recent observation can be used as state (k=1).
Something else I think is cool is that if you can learn these models, John and Istvan are thinking about how to do the planning part. We could glue all the pieces together and do something really neat.
That sounds excellent.
I should also point out I believe this will solve other problems in Pitfall, such as the ability to figure that log movement isn’t conditional on player movement, and that scorpion movement is. Its basically just a technique that says what variables help predict a function and what don’t. Additionally, I thought the ability of the GP to report variance at a point would help, but it really doesn’t, since its the same for both x and y trained GPs.