Category Archives: MDPs

Masters Thesis: Automatic Decomposition of Continuous Action and State Spaces in Simulation-Based Planning. Colin Schepers


  1. Main goal of the work is to “interchange” the discretization components of:
    1. Tree Learning Search, which discretizes according to a tree of trees where each tree further decomposes the action space
      1. Underlying component is Incremental Regression Tree Induction (IRTI)
    2. HOLOP, where the tree decomposes a space that corresponds to a sequence of actions
      1. Underlying component is HOO
  2. TLS has the issue of throwing away data when new information arrives and trees must be discarded
  3. Also include the idea of transposition tables
  4. Work examines behavior, computation time, computational complexity, and memory of both algs
  5. There are also some new algorithms introduced that extend these algorithms that have better performance in certain situations


Ch 1: Introduction

  1. In terms of doing the actual action planning, talks of two possible options:
    1. Meta Tree Learning (MTL) is the more traditional approach to MCTS, called meta tree because it constructs a tree of trees
    2. Sequence Tree Learning (STL) explodes the sequence into a large space where each dimension corresponds to one step in the sequence (what HOLOP does)
  2. IRTI and HOO can be combined with either MTL or STL to get 4 planning algorithms (originially TLS was coupled with IRTI and HOLOP with STL)
    1. IRTI x MTL = Regression-based Meta Tree Learning (RMTL), very similar to TLS
    2. STL x HOO = Hierarchical Optimistic Sequence Tree Learning (HOSTL), very similar to HOLOP
    3. IRTI x STL = Regression-based Sequence Tree Learning (RSTL), introduced in this work
    4. MTL x HOO = Hierarchical Optimistic Meta Tree Learning (HOMTL), also introduced in this work
  3. The novelty of using transposition tables here is due to the fact that the spaces considered here are continuous, not discrete as is the case with most transposition tables. Two methods are proposed to do this
  4. The Basic Research Questions for the Thesis are:
    1. Can the IRTI component of TLS be replaced by the HOO component of HOLOP
    2. Can the HOO component of HOLOP be replaced by the IRTI component of TLS
    3. How can the information retreived from simulations be reused for TLS
    4. Can a transposition tree increase the performance of simulation based systems in continuous environments
    5. Which combination of techniques is best
    6. What is the memory cost of these algorithms
    7. What is the computational time complexity of the algorithms

Ch 2: Preliminary Knowledge

  1. Discusses use of Zobrish hashing for high-performance transition tables.  I’ve never heard of it.

Ch 3: Automatic Decomposition

  1. IRTI checks all possible tests (splits) in a leaf node, and takes the one with the highest information gain.  If that split yields two different leaves that are statistically significantly different (F-test), the split is made
  2. <p. 14 the rule used by HOO to calculate b-scores is very reminiscent of UCB as well>
  3. <p. 14 in an extended version of the HOO paper on Arxiv, a version of the algorithm is presented where n_0 doesn’t have to be recalculated at each step, if the total number of pulls is known before sampling starts.  This can make the algorithm much more efficient (log n per step instead of n)>
  4. Discussion of scaling of the exploration factor (commonly called C)
    1. The propose scaling it according to the range of values seen from each node, then modifying it by a factor which is called k
    2. <p. 16, in practice, I definitely buy that this helps, but in certain situations this rule will cause problems (when all rewards observed in a node are the same, there will be no bias, for example)>
    3. But they also mention that if vmin, vmax are known for the problem it can be used, so thats all fine.  If you don’t know that you need to resort to something like above
  5. Discuss ways of conserving memory, such as not allowing leaves to split that have too few samples.  Capping tree depth isn’t mentioned, but is also a reasonable option

Ch 4: Multi-step Optimization in Continuous Environments

  1. In normal MCTS, an edge represents 1 action, and a node represents 1 state. In Regression-based Meta Tree Learning, an edge represents a range of actions, and a node represents a range of states
  2. When selecting an action, UCT rules are used (basically just ignore that edges and nodes represent a range and use the samples)
  3. If a sample causes a split in a leaf in an internal leaf, the trees below that point must be discarded
    1. Replay can be used once trees are discarded, but this is somewhat expensive.  It is cheaper to do sequence planning and pass values into the children on splits (sequence planning, though, has the drawback that it can’t form closed-loop plans)
  4. <p. 21 It isn’t the case that the given splitting rule for HOLOP/HOSTL that prioritizes early actions has to be used, but it is one method that intuitively makes sense, and also doesn’t ruin the original proofs of HOO>
  5. <In general, the way the material is presented respects the original algorithms this paper builds on too much.  It basically talks about TLS and HOLOP (which are both unique in both aspects of how to do decomposition as well as plan over sequences) and then crosses them over.  It would be easier for most readers not already familiar with the history of previous publications to present for example the Meta Tree Learning algorithms and then the Sequence Tree Learning algorithms, or something like that.  It starts with the corners of a 2×2 table describing the attributes of the algorithms instead of starting with a row or column>

Ch 5: Continuous Transposition Tree

  1. Two methods for constructing transposition tables are introduced
  2. In Level-based Transposition Trees, there is a clear distinction between the discretizations over the state and action spaces
  3. <p. 24 “This process [of descending a decision tree over the state space] is expected to be less computationally expensive because it does not require any complex computations like the UCT formula.”  Is there really any significant difference between the costs of both aside from constant level operations which are very cheap anyway?>
  4. One important feature of using the transposition table is that it allows planning to be stored – otherwise planning always has to be redone from every current state (of course it can also make planning cheaper further down the tree)
  5. In LTT a decision tree decomposes the state space.  From each leaf in that tree over the state space, another decision tree is rooted that decomposes the action space
    1. Still suffers from the need to discard action trees on splits in the state-tree leaves, but its easy to do replay in this setting
  6. In a Mixed Transposition Tree (MTT), the tree is built both in terms of states and actions (as opposed to states and then actions, as in LTT); a node represents both spaces and an edge from parent to child represents a split in either the state or action dimension
  7. When MTTs are used, from root to leaf states are traversed according to the query state (naturally), while actions are followed according to those that has the highest value according to a particular computation, I think it is basically like UCB, which allows for exploration.
  8. The advantage MTT has over LTT is that trees do not have to be discarded and rebuilt
  9. <In general, this thesis has a lot of verbiage, but is short on diagrams or other items that would express points succinctly, sometimes the material is presented in a way that is a bit vague and the only way to resolve the issue is to go back to pseudocode and grok it>

Ch 6: Experiments and Results

  1. UCT here isn’t the normal version of UCT, somewhat between HOSTL and IRTI
  2. A couple of the test domains are regular 1-step optimization problems.  Two more are multi step problems.  One is navigating in a circle (agent chooses angle to move in), and the other is cart-pole
  3. For the 1-step optimization problems, comparisons are also made with a random agent, a vanilla MC agent (random but choses best found), and a version of UCT that uses unifom discretization
  4. In 1-step optimization, UCT learns a good policy quickly, but HOO eventually outperforms it with near optimal performance; the regret plots are most illustrative of performance.  IRTI is worse than HOO in both cases (in one case better than UCT by the end of the experiment and in one case worse, but anyway as more time is given IRTI will beat UCT anyway)
  5. Good parameterizations are found for each problem separately.
  6. When constraining the algorithms on time instead of samples, HOO performs worst (due to polynomial time complexity) <as mentioned, an nlogn version can be implemented, though>.  When time is the main constraint IRTI performs best, vanilla-MC actually outperfoms UCT
  7. In multistep experiments, all four algorithms (2×2 grid plus random and vanilla mc) are examined
  8. In donut world, when samples are constrained, HOSTL (HOLOP) performs best, IRTI x STL is 2nd best
  9. In stark contrast, in the cart-pole domain, HOSTL is worst besides random (even vanilla MC is better).
    1. Here, RMTL (original tree-learning search) performs best by a wide margin
    2. The domain used here is more challenging than the one I’ve used as it doesn’t allow for much deviation of the pole
    3. This is under time constraints
  10. <p. 35They test MTT by itself which I’m not exactly groking because I thought it was a technique for allowing tranposition tables to be used and not a policy, but I’m missing something>
  11. Have very nice experiments of sample distributions on 1-step optimization problems
  12. The HOO-family of algorithms have by far the worst memory use.  The regression-tree algorithms use the least memory <I guess cause they are constantly throwing trees out?  Even if they are doing replay, the trees are probably much more compact than the HOO versions because it requires a test confirming statistical significance to split>
  13. In terms of measured computation time, not surprisingly HOO is slowest, but what is interesting is that IRTI is faster than UCT <Impressive.  It is because UCT ultimately ends up building a larger tree>

Ch 7: Experiments and Results

  1. It is actually unclear whether transposition tables in the manner used are helpful (sometimes they help and sometimes they do not)
  2. HOSTL (HOLOP) is best when the domain isn’t time constrained, but RMTL(TLS) is best when there are time constraints as it is very efficient because trees are built fairly compactly
  3. While HOO-based algorithms used the most memory, there were no issues of exhausting memory during experiments or anything like that

Ideas from Decision Theoretic Planning

Decision Theoretic Planning, Boutilier, 1999:

  • Fully Observable MDPs can be solved in time poly to |state|*|action|*|inputs|, most commonly used are value, policy iteration (p. 24).
    • Finite-horizon requires O(|S|^2 |A|), computation per iteration, infinite horizon is O(|S|^2 |A| + |S|^3) per iteration. Both converge in a polynomial number of iterations.
  • POMDPs are exponentially hard in |state| and |horizon| (p.25).
    • The problem of finding a policy that maximizes (or approximately maximizes) expected discounted reward over an infinite horizon is undecidable.
    • Littman (1996) examines a special case of POMDP where boolean rewards are given and the goal is to find an infinite horizon policy with nonzero total reward given that the rewards associated with all states are non-negative.
  • In probabilistic goal-oriented planning for POMDPs, search is usually done for a probability distribution over states. Even simplest problem in this domain is undecidable. Why? The set of states may be finite, but the set of distributions over the states is not, and worst case the agent may have to search an infinite number of plans before being able to determine whether a solution exists (p. 25)
  • Policies under finite-horizon problems are generally non-stationary (even if the MDP is), whereas policies under infinite-horizon problems are stationary (p.29)
  • In value iteration, the sequence of estimated value functions converges linearly to the optimum (p. 29).
  • Policy iteration can converge quadratically, but in practice converges faster than value iteration (p.30).
  • To best make decisions in POMDPs, it is necessary to look not only at the most recent observation, but the entire history of the current trajectory. History dependent policies grow in size exponential to the horizon length. Can instead use a probability distribution over states (belief states), and from this distribution a policy can be computed (p. 31).
    • When using belief states, the problem becomes one that can be viewed as a fully observable MDP.
  • RTDP (Barto et al 1995) is an online asynchronous value iteration algorithm (p. 36).
  • In a DBN, diachronic arcs go across time (from t-1 to t), and synchronic arcs stay in the same time frame (from t to t).
  • It is possible to restructure the graphs so that a process that required synchronic arcs can be modeled without synchronic arcs, but this is basically just graph restructuring trickery (p. 50).
  • Aggregation techniques can be used to group regions of the state space together to allow for more efficient (less granular) computation to be done.  Worst-case costs for value function calculation is cubic in the size of the largest cluster (p. 58).
  • Goal regression is the process of achieving subgoals, each of which brings the state of the system closer to the goal state, without undoing the work a previous subgoal accomplished (p. 60).
  • A technique called model minimization attempts to find a minimal (or at least smaller) representation of the MDP that is equivalent.  Takes idea from state machines (p. 70).
  • Problem decomposition takes the MDP and breaks it up into a number of independent MDPs, each of which has a state which solves some aspect of the original MDP (p. 71).
  • There is an algorithm that identifies recurrent classes of an MDP as represented by a 2DBN (2-step dynamic bayes net, Boutilier, Puterman 1995) (p. 75).
  • Can break up an MDP into a set of parallel MDPs that run together (p. 79).
Tagged ,

Ideas from Learning the Structure of Dynamic Probabilistic Networks

From Learning the Structure of Dynamic Probabilistic Networks by Freidman, Murphy, Russell:

scores combine the likelihood of the data according to the network … with some penalty relating to the complexity of the network. When learning the structure of PNs, a complexity penalty is essential since the maximum-likelihood network is usually the completely connected network.

Well, one nice thing about my post on Estimating the order of an infinite state MDP with Gaussian Processes is that because of the regression, variables that don’t help predict the desired value increase the error.  This error itself acts as a regularizer that metrics for scoring learned DBNs need to have added artificially.

Paper also discusses Friedman’s (Learning belief networks in the presence of
missing values and hidden variables, 1997)  Structural EM (SEM) algorithm, which can modify the structure of the DBN being learned during the M-step.

Estimating the order with other regression algorithms

When you asked if the property that seemed to show up in the last experiment was something that I thought was particular to GPs, I said no, that I figured that other regression techniques should also perform similarly.

Well, thanks to Weka, by just changing one line of code I was able to perform that experiment with a number of different regression algorithms, and the results are mixed.

Just for reference, this was the result with GPs (all plots are mean absolute error):

GP error

GP error

Which looks very clean, and almost parabolic.  Now here are the corresponding plots for other regression algorithms:

Linear Regression:

Linear regression error, order 1 to 20

Linear regression error, order 1 to 20

The order/error graph for LinReg is pretty well behaved, but oddly the error spikes at 2.  There seems to be some flat regions and a then high degree of change in a short period of time with some noise.  The minimum error is at order 9.


Perceptron error vs order

Perceptron error vs order

This is the noisiest graph produced out of any of the regression algorithms I tried.  Even still there is a noticeable gain in error as order increases.  Like GPs, Perceptrons had the smallest error at order 2.

SVM Regression:

SVM Regression error, order 1 to 6

SVM Regression error, order 1 to 6

This one is interesting, the error is very low and flat for all values aside from at 1.  Even still, the error at order 2 is smallest than all other values I tried by a small amount.  I don’t know much about SVM regression though, so its tough for me to interpret this at all.

I also tried Isotonic Regression (I don’t know what that is either), and the results were almost perfectly flat and consistent, with error minimized at order 1, the error for order 1 to 14 were all between 13.7 and 13.9, so Isotonic Regression seems to give you nothing for this approach.


Like GPs, SVM Regression and Perceptrons had error minimized at 2.  Isotonic had minimum error at order 1.  LinReg was quite far off at 9.  In terms of the lowest observed error over all algorithms at any one point Perceptrons actually performed the best.

SVM and Isotonic Regression had very slowly growing error as a function of order, so the fact that they perform well in those cases actually makes it poorly suited for this approach.

The general shape of the graph with LinReg and Perceptrons seemed to be similar to that demonstrated by GPs, but with a higher degree of noise, which I suppose also makes them less well suited for the task.

Based *only* on this experiment it seems like GPs may be best for this approach, as the graph looks the best behaved, and seemed to find the “right” value.  This is a highly qualified statement, in that the results here seem weak.  If this experiment could be reproduced with similar results in another domain, I would be more confident of that statement.

Estimating the order of an infinite state MDP with Gaussian Processes

This idea came to me the evening after our meeting on Wednesday. For some reason I was thinking about modeling the ghost movement in Pac-Man when I thought of it, but it seemed like it may be well suited to Pitfall! (and hopefully many other things as well).

Here’s the idea in a few sentences: Given that you are observing state, action transitions of some MDP with no knowledge of the underlying MDP, how would you go about estimating the order of the MDP? Maybe its reasonable to try using standard regression techniques, and try regressing a number of functions, and see which does best. What I mean by a number of functions is build each regressor to use a different amount of history (the most recent time steps), and then whichever regressor performs best tells us an estimated order of the MDP.

Yes, using past frames is definitely a way to make better predictions.

Cool – is this general idea already a well-known / explored one?

Using history to make up for missing state information?  Sure.

Lets go back to Pitfall! for a second – at a couple of meetings I was at we were at we discussing how to model jumping, and the more we thought about it, the more complicated the idea got. As I recall the basic thing we were saying was, well if the emulator gives you one snapshot of the guy in the air, how do you know where he’ll be when he lands?

It seems to me like thats a very restricted way of looking at the problem, and that if you look at not just the most recent snapshot, but also a number of previous snapshots, it should be pretty simple to figure out where the trajectory is going to take Pitfall Harry when he lands. Thats because you can’t estimate velocity with just one sample – if you see Pitfall Harry in the air he could just as easily be jumping straight up or forward – there’s no way of knowing.

Right? So more generally, for some moving body, consider that you are given snapshots every tick, and the goal is to guess where its going to be next. The way I see things, given the data we have, position is an order 1 variable, but velocity is an order 2 variable. So given the most recent snapshot, and the one before, we know both where it is and where its going. Would having a history of the past 10 snapshots help? Not really if the dynamics are simple (I’m using this as a concrete example, but I think this is more generally applicable).

So, I wanted to test the idea out, and since it seemed like it lended itself well to the Pitfall! work, I figured I’d hack up a simple physics engine and track the movement of a body which has dynamics basically as follows:

  • It can “run” to the left or the right, doing so repeatedly will keep increasing velocity until some maximum velocity is reached.
  • There is friction, so if its going quickly to the right and tries to go left, it will continue going right, just slower. Exerting still more force to the left will eventually make it move left.
  • Simple gravity, it can “jump” if it is currently on the floor. While in the air, trying to run or jump don’t work.
  • There are walls, if it hits the wall it just stops there.

Cool.  Although, since you hacked it up, it violates the RL3 prime directive.  Nonetheless, it makes it easier to collect data fast at this stage, so I’m not objecting.

Fair enough – right I did this more as a proof of concept, I’d be happy to try it on things not of my own doing in the future, I was thinking aside from pitfall, stocks might also be an interesting application?  Just an idea, could be anything.

I’d be more inclined to some other game.  Stocks are a bit of a
loaded concept right now.

Haha – fair enough.

I had the agent behave randomly, with the following behavior: Each tick go left, right, or jump each with 5% probability. This actually gives pretty good and varied movement. I logged the position and action at each time tick, where the action is the force exerted to the left, right, or up (even if it was in the air and that force did nothing). I have it all nicely animated to keep me entertained.

Once I had all this logged, I assembled GP regressors, one to predict the X position and one to predict the Y position. I made a number of them, from what I will call “order” 0 to order 5 (although I have a feeling I’m abusing this term). Anyway, the value representation for each order goes as follows:

  • Order 0: you know nothing
  • Order 1: current values of [x, y, force_x, force_y]
  • Order 2: everything in order 1, and also those values for the step preceding whats in order 1
  • Order 3: everything in order 2, and also those values for the step preceding whats in order 2
  • etc

So, how much data do you need to most accurately predict where the agent will be next? Clearly in order 0 you are hosed. As I explained above, order 1 is pretty tough, because you don’t have any idea of velocity (but you do know your change in velocity from the action), and order 2 gives you what you need. My prediction was that eventually more isn’t better, so there should be some bowl shaped mean error function over order – too little isn’t good because there’s no way to do the estimates, but too much is also bad because your training points may not have something close to what you just saw (imagine an order 50 regressor, good luck finding a training point similar to that), making estimation tough.

Or “data sparsity”, more generally.

So, here’s the show me the money section:

The x component is red, the y component is green. Both the train set and the test set were 2000 samples large.

mean absolute error, x-axis is order of regressor, y is error

mean absolute error, x-axis is order of regressor, y is error

mean squared error, x-axis is order, y-axis is error^2

mean squared error, x-axis is order, y-axis is error^2

Ok, that order 0 regressor sucks. Lets re-plot that starting at order 1 so we can actually see whats going on.

mean absolute error, order 1 to 5

mean absolute error, order 1 to 5

mean squared error, order 1 to 5

mean squared error, order 1 to 5

Now lets make it pop by summing both x and y errors:

x+y mean absolute error

x + y mean absolute error

x + y mean squared error

x + y mean squared error

So that pretty clearly shows that order 2 is the best way to go in terms of both mean absolute error and MSE. I claim success.

And to tie it up, the variance:

mean GP variance over sample set

mean GP variance over sample set

Which I interpret as the higher the order, the more difficult it is to find similar points, which makes sense because the state space is exploding. This explains why you can’t just throw the kitchen sink at the problem and call every MDP an order 1000 MDP and just have it work. With an order too small you cant do the estimate, but with an order too high the variance just causes lots of trouble. That is evident in the apparent superlinear growth of error as the actual order of the MDP is exceeded.

Great job—really interesting insight, intriguing results.  Do you really think the GP part is helping here, or would any learning algorithm have done more or less the same thing?

Thank you very much.  I originally did think that GP would be special for this, and that I would be able to use the variance to plot these curves instead of the error, but that was due to a poor understanding of GPs on my part.  It turns out the variance has nothing to do with the variability of the value you are trying to estimate, but rather just the distance of the query point from what its already seen, which is why the variance for both the x and y GPs are identical, even though the x value is more difficult to predict than the y variable (because gravity makes the agent spend most of the time on the floor).

I’ll try this out with other regression algorithms just to confirm.

I left a comment on the page suggesting that maybe it’s time to get some data from Pitfall! and try running on that.  It might also be worth comparing to some other approach so we can see if GP is
responsible for the success or if you just set up the problem so well that any method would succeed.

I think both would be great.  I mentioned the general idea to Carlos before I had results and he seemed interested to see what the results would be.  I’ll try some other applications in domains I didn’t craft up and see how things go.

I’m wondering if we should try again to form a Pitfall! team and really do something dramatic…

I would be totally pumped.  I think I’m gonna try and work on the Pitfall logs in simulation next and see if it finds that Harry’s location is not useful to help predictions, in simulation for now, but hopefully I can get the emulator running (or at least some data) and then try it there.

I’d think Carlos or Andre could give you traces from the crocodiles or jumping or vine swinging—something with non-Markovian behavior you could try to model.

But aren’t these things Markovian if represented by an MC with high enough order?  For example, if i just look at the crocodiles now, I can’t really figure much of anything out, but given enough history I should be able to estimate almost perfectly how the mouths open and close – I just need the order of the MC to be long enough to capture one period of the mouth opening and closing?

Sorry, I was speaking informally.  I just meant fully observable, Markov order 0.

Yes, there are several levels at work here, although I’m not sure
quite what to call them:

* non-Markov, non-stationary: transitions can’t be captured modeled using “states”.
* infinite-dimensional POMDP: A weird class that can be modeled by PSRs but probably not learnable or useful.
* finite-state POMDP: Can be modeled by an underlying Markov process that is not completely observable.
* finite-order Markov: Special kind of POMDP where the most recent k observations can be used as state.
* Markov, completely observable: Most recent observation can be used as state (k=1).

Something else I think is cool is that if you can learn these models, John and Istvan are thinking about how to do the planning part.  We could glue all the pieces together and do something really neat.

That sounds excellent.


I should also point out I believe this will solve other problems in Pitfall, such as the ability to figure that log movement isn’t conditional on player movement, and that scorpion movement is.  Its basically just a technique that says what variables help predict a function and what don’t.  Additionally, I thought the ability of the GP to report variance at a point would help, but it really doesn’t, since its the same for both x and y trained GPs.