## Adaptive Submodularity: A New Approach to Active Learning and Stochastic Optimization. Glovin, Krause, Ray. Talk (NIPS 2010)

From http://videolectures.net/nipsworkshops2010_golovin_asm/

1. Basic idea is consider submodular functions, they they are stochastic and unfold over time.  So instead of making all decisions in one shot (before execution begins), better to wait to see how the first decision unfolds before making the second, etc.
1. Gives the example of influence in social networks – if you want to get people to plug a product you give to a few people for free, see how influence is spreading from the first person before deciding the second
2. The setting described here is a generalization of standard suboptimality
3. Analysis here allows many results from standard submodular optimization to the adaptive setting
4. Instead of maximizing margin, maximizes expected margin based on current state, but does this only for the first selection, and the results of the first selection on state then become conditions for the next choice
5. Gets standard bounds of 1-1/e
6. Because adaptive, can now do things like “take the minimal number of actions needed to achieve X”, which you couldn’t do with standard submodularity (there, you can just say “maximize the benefit of N choices”
1. This is a log-approximation ln(n)+1 for stochastic set cover
7. Another application optimal decision trees
1. “Generalized binary search” which is equivalent to max info gain
2. This problem is adaptive submodular <I thought things like xor in decision trees arent submodular, but maybe because we are talking w.r.t. information gain it is?>
3. Give bounds for this approximation
4. “Results require that tests are exact (no noise)!”
8. If trying to do decision trees in noisy setting, can do a Bayesian version
1. But generalized binary search, maxing info gain, and maxing value of info aren’t adaptive submodular in noisy setting.  In noisy settings they require exponentially more tests than optimum <Should really read this paper>
9. Trick is to transform noisy problem into normal problem by making outcome part of the hypothesis
1. Only need to distinguish between noisy hypotheses that lead to different decisions
10. Gives bounds for this Bayesian active learning for noisy domains
11. Give an example of testing different psyhological/economic models of how people perform decision making
1. For this setting, turns out many common approaches actually do much worse than random

## Masters Thesis: Automatic Decomposition of Continuous Action and State Spaces in Simulation-Based Planning. Colin Schepers

Abstract

1. Main goal of the work is to “interchange” the discretization components of:
1. Tree Learning Search, which discretizes according to a tree of trees where each tree further decomposes the action space
1. Underlying component is Incremental Regression Tree Induction (IRTI)
2. HOLOP, where the tree decomposes a space that corresponds to a sequence of actions
1. Underlying component is HOO
2. TLS has the issue of throwing away data when new information arrives and trees must be discarded
3. Also include the idea of transposition tables
4. Work examines behavior, computation time, computational complexity, and memory of both algs
5. There are also some new algorithms introduced that extend these algorithms that have better performance in certain situations

cool.

Ch 1: Introduction

1. In terms of doing the actual action planning, talks of two possible options:
1. Meta Tree Learning (MTL) is the more traditional approach to MCTS, called meta tree because it constructs a tree of trees
2. Sequence Tree Learning (STL) explodes the sequence into a large space where each dimension corresponds to one step in the sequence (what HOLOP does)
2. IRTI and HOO can be combined with either MTL or STL to get 4 planning algorithms (originially TLS was coupled with IRTI and HOLOP with STL)
1. IRTI x MTL = Regression-based Meta Tree Learning (RMTL), very similar to TLS
2. STL x HOO = Hierarchical Optimistic Sequence Tree Learning (HOSTL), very similar to HOLOP
3. IRTI x STL = Regression-based Sequence Tree Learning (RSTL), introduced in this work
4. MTL x HOO = Hierarchical Optimistic Meta Tree Learning (HOMTL), also introduced in this work
3. The novelty of using transposition tables here is due to the fact that the spaces considered here are continuous, not discrete as is the case with most transposition tables. Two methods are proposed to do this
4. The Basic Research Questions for the Thesis are:
1. Can the IRTI component of TLS be replaced by the HOO component of HOLOP
2. Can the HOO component of HOLOP be replaced by the IRTI component of TLS
3. How can the information retreived from simulations be reused for TLS
4. Can a transposition tree increase the performance of simulation based systems in continuous environments
5. Which combination of techniques is best
6. What is the memory cost of these algorithms
7. What is the computational time complexity of the algorithms

Ch 2: Preliminary Knowledge

1. Discusses use of Zobrish hashing for high-performance transition tables.  I’ve never heard of it.

Ch 3: Automatic Decomposition

1. IRTI checks all possible tests (splits) in a leaf node, and takes the one with the highest information gain.  If that split yields two different leaves that are statistically significantly different (F-test), the split is made
2. <p. 14 the rule used by HOO to calculate b-scores is very reminiscent of UCB as well>
3. <p. 14 in an extended version of the HOO paper on Arxiv, a version of the algorithm is presented where n_0 doesn’t have to be recalculated at each step, if the total number of pulls is known before sampling starts.  This can make the algorithm much more efficient (log n per step instead of n)>
4. Discussion of scaling of the exploration factor (commonly called C)
1. The propose scaling it according to the range of values seen from each node, then modifying it by a factor which is called k
2. <p. 16, in practice, I definitely buy that this helps, but in certain situations this rule will cause problems (when all rewards observed in a node are the same, there will be no bias, for example)>
3. But they also mention that if vmin, vmax are known for the problem it can be used, so thats all fine.  If you don’t know that you need to resort to something like above
5. Discuss ways of conserving memory, such as not allowing leaves to split that have too few samples.  Capping tree depth isn’t mentioned, but is also a reasonable option

Ch 4: Multi-step Optimization in Continuous Environments

1. In normal MCTS, an edge represents 1 action, and a node represents 1 state. In Regression-based Meta Tree Learning, an edge represents a range of actions, and a node represents a range of states
2. When selecting an action, UCT rules are used (basically just ignore that edges and nodes represent a range and use the samples)
3. If a sample causes a split in a leaf in an internal leaf, the trees below that point must be discarded
1. Replay can be used once trees are discarded, but this is somewhat expensive.  It is cheaper to do sequence planning and pass values into the children on splits (sequence planning, though, has the drawback that it can’t form closed-loop plans)
4. <p. 21 It isn’t the case that the given splitting rule for HOLOP/HOSTL that prioritizes early actions has to be used, but it is one method that intuitively makes sense, and also doesn’t ruin the original proofs of HOO>
5. <In general, the way the material is presented respects the original algorithms this paper builds on too much.  It basically talks about TLS and HOLOP (which are both unique in both aspects of how to do decomposition as well as plan over sequences) and then crosses them over.  It would be easier for most readers not already familiar with the history of previous publications to present for example the Meta Tree Learning algorithms and then the Sequence Tree Learning algorithms, or something like that.  It starts with the corners of a 2×2 table describing the attributes of the algorithms instead of starting with a row or column>

Ch 5: Continuous Transposition Tree

1. Two methods for constructing transposition tables are introduced
2. In Level-based Transposition Trees, there is a clear distinction between the discretizations over the state and action spaces
3. <p. 24 “This process [of descending a decision tree over the state space] is expected to be less computationally expensive because it does not require any complex computations like the UCT formula.”  Is there really any significant difference between the costs of both aside from constant level operations which are very cheap anyway?>
4. One important feature of using the transposition table is that it allows planning to be stored – otherwise planning always has to be redone from every current state (of course it can also make planning cheaper further down the tree)
5. In LTT a decision tree decomposes the state space.  From each leaf in that tree over the state space, another decision tree is rooted that decomposes the action space
1. Still suffers from the need to discard action trees on splits in the state-tree leaves, but its easy to do replay in this setting
6. In a Mixed Transposition Tree (MTT), the tree is built both in terms of states and actions (as opposed to states and then actions, as in LTT); a node represents both spaces and an edge from parent to child represents a split in either the state or action dimension
7. When MTTs are used, from root to leaf states are traversed according to the query state (naturally), while actions are followed according to those that has the highest value according to a particular computation, I think it is basically like UCB, which allows for exploration.
8. The advantage MTT has over LTT is that trees do not have to be discarded and rebuilt
9. <In general, this thesis has a lot of verbiage, but is short on diagrams or other items that would express points succinctly, sometimes the material is presented in a way that is a bit vague and the only way to resolve the issue is to go back to pseudocode and grok it>

Ch 6: Experiments and Results

1. UCT here isn’t the normal version of UCT, somewhat between HOSTL and IRTI
2. A couple of the test domains are regular 1-step optimization problems.  Two more are multi step problems.  One is navigating in a circle (agent chooses angle to move in), and the other is cart-pole
3. For the 1-step optimization problems, comparisons are also made with a random agent, a vanilla MC agent (random but choses best found), and a version of UCT that uses unifom discretization
4. In 1-step optimization, UCT learns a good policy quickly, but HOO eventually outperforms it with near optimal performance; the regret plots are most illustrative of performance.  IRTI is worse than HOO in both cases (in one case better than UCT by the end of the experiment and in one case worse, but anyway as more time is given IRTI will beat UCT anyway)
5. Good parameterizations are found for each problem separately.
6. When constraining the algorithms on time instead of samples, HOO performs worst (due to polynomial time complexity) <as mentioned, an nlogn version can be implemented, though>.  When time is the main constraint IRTI performs best, vanilla-MC actually outperfoms UCT
7. In multistep experiments, all four algorithms (2×2 grid plus random and vanilla mc) are examined
8. In donut world, when samples are constrained, HOSTL (HOLOP) performs best, IRTI x STL is 2nd best
9. In stark contrast, in the cart-pole domain, HOSTL is worst besides random (even vanilla MC is better).
1. Here, RMTL (original tree-learning search) performs best by a wide margin
2. The domain used here is more challenging than the one I’ve used as it doesn’t allow for much deviation of the pole
3. This is under time constraints
10. <p. 35They test MTT by itself which I’m not exactly groking because I thought it was a technique for allowing tranposition tables to be used and not a policy, but I’m missing something>
11. Have very nice experiments of sample distributions on 1-step optimization problems
12. The HOO-family of algorithms have by far the worst memory use.  The regression-tree algorithms use the least memory <I guess cause they are constantly throwing trees out?  Even if they are doing replay, the trees are probably much more compact than the HOO versions because it requires a test confirming statistical significance to split>
13. In terms of measured computation time, not surprisingly HOO is slowest, but what is interesting is that IRTI is faster than UCT <Impressive.  It is because UCT ultimately ends up building a larger tree>

Ch 7: Experiments and Results

1. It is actually unclear whether transposition tables in the manner used are helpful (sometimes they help and sometimes they do not)
2. HOSTL (HOLOP) is best when the domain isn’t time constrained, but RMTL(TLS) is best when there are time constraints as it is very efficient because trees are built fairly compactly
3. While HOO-based algorithms used the most memory, there were no issues of exhausting memory during experiments or anything like that

## What can 20,000 models teach us? Lessons learned from a large-scale empirical comparison of learning algorithms Alexandru Niculescu-Mizil. Talk

1. This talk covered the speaker’s experience in implementing a large number of supervised learning methods and his observations of performance.
2. There were some parts that some members of the audience disagreed with, such as the way probabilities were inferred from the classifiers (not all of them are designed to allow this, so making this happen is a bit off-label), and the way regression algorithms were made to perform classification is a problem not restricted to this talk, but overall there was definitely a large amount of useful information.  Its also worth pointing out the speaker was very open about where we should draw conclusions from results and where not to, which was refreshing.
1. Also worth pointing out that he found new parameterizations for each validation fold, which I thought is normally not done, I think this helps brittle algorithms too much
2. He had many different metrics and combined them in a way I thought could be improved upon (although it was reasonable).  Things were normalized between performance probability levels and the best algorithm; I would have used z-scores or something similar.
3. The speaker pointed out the problems he worked on were of medium size (for classification, binary, 10-200 features, 5000 (I think I dropped a zero samples)), so this is not big data, but definitely medium-sized
4. 10 learning algorithms with numerous parameterizations tested, which is where the 20,000 models comes from
5. The performance of the algorithms as initially tested was surprising.   The order from best to worst was:
1. Bagged Trees
2. Random Forests
3. ANNs (I was very surprised to see them this high)
4. Boosted Trees (I expected these above bagged trees and random forests)
5. KNN (I thought this would be lower)
6. SVMs (This low?  I feel like something may have been funny with the implementation or parameter search)
7. Boosted Stumps
8. Decision Trees (This was a shocker)
9. Logistic Regression
10. Naive Bayes (Naturally)
6. Boosted trees in particular was #1 in the most categories, but was particularly bad in a few
7. Also, almost every algorithm was best in at least one dataset and/or metric overall, so the speaker was clear in pointing out that this list should only inform you of the order to try methods on your particular problem until you find one that works
8. Discussed the use of Platt scaling and Isotonic regression to correct some problems in boosted trees, after which the algorithms performed better than anything else.  I had not heard of these approaches before.
9. There was a lot of focus on calibration in the talk
10. The punchline?  Use ensemble methods.  The ensembles by far had consistently the best performance, unlike the best non-ensemble methods, that were outperformed depending on the task.
11. Overall, he said the best performance metric is cross entropy.  I am familiar with this phrase in terms of optimization only.

## Update on Cached Planning

In my qual, I discussed an approach to RL that was focused on developing a system which could deployed successfully on robotic platforms.  The system presented there seemed to be quite effective and efficient in the experiments I ran, but one piece didn’t work.

The algorithm which I called “TCP” for Tree-based cached planning is designed to allow the algorithm to react to queries for policy very quickly even if planning takes much longer.  The idea is to decompose the state space by using a tree, and to then store a policy recommendation for an action to take in the region represented by a leaf.  The method which I first tried (splitting in the same manner used by MRE/KD-Trees) was successful when a generative model was already available, and also worked reasonably with epslion-greedy exploration.  When using MRE, the results tended to be unstable, varying dramatically between pretty good and terrible behavior.

After thinking about the issue for a while, I came up with a reason that may describe why the previous method was not successful.  We use a method of splitting in KD-Trees because we believe we have enough samples in a region that we can partition that region into a smaller volume and still have enough samples to inform an accurate decision about the function being approximated.

Using this method of splitting to represent the policy makes less sense because the policy is computed by performing rollouts using the estimated transition and reward functions (along with Smaxes thrown in).  Therefore, we want to split a cell representing the policy only when we have enough information to make an accurate decision as to what the policy looks like.  That is, we need precision for a string of T(s,a),R(s,a), as opposed to just one.

Based on this idea, the algorithm is changed slightly.  The policy tree is informed of the percent of times Smax was encountered during planning (doing rollouts) from a particular state.  If that percent drops below a predetermined value, then we decide that we know a part of the region well enough to partition it up to make more refined decisions, and we make a cut in that region.

Here’s a video of the algorithm running in the Double Integrator domain.  Here the threshold for cutting was set to be below 5%, so that if visits to Smax was less than 5% of all the state visits during planning, a cut is made.

Also, a graph of the performance: The performance here is not terribly far from the performance when not using cached planning.  With TCP the average cumulative reward seems to converge to something around -4.0.  If we replan at every step with the same number of trajectories per planning step (100) the performance is around -2.8, so there is a performance hit, as we would expect, but I think it is within reason.  Although it is not directly a fair comparison to make, if we replan at every step, the performance converges after about 10 trials, whereas here it seems to happen after about 75 trials.

I have to test this out in larger domains, but if it holds up ok (not a trivial thing to hope for) we can move on to trying it on a robot.

## Ideas from papers on Random Decision Trees

Is random model better? On its accuracy and efficiency, Wei Fan, Haixun Wang, Philip S. Yu, and Sheng Ma (ICDM 03?)

• Basic idea is that instead of trying to build one optimal classifier, build many unsophisticated ones and average the predictions over all unsophisticated classifiers
• The specific proposal here is to use random decision trees, which are trees that terminate at a predetermined depth and branch on a random feature at a random boundary at each branch
• Can be modified to accept missing features
• This method doesn’t fall into the classic xor problem that most incremental decision tree learners fall into – RDTs don’t even look at the data at the time the tree is built
• Argument that it is optimal to maximize tree diversity.  For example, we could build trees so high that each sample is recorded exactly in the tree, but this isn’t what we are looking for because we want to average over many trees, and when the trees are all too deep they will all give basically the same prediction
• Based on simple combinatorics, if there are k features, the largest number of possible ways of combining a subset (of size i) of those features is  i choose k, which is maximal at i = k/2
• This means maximal diversity is reached when the tree depth is capped at k/2.  Empirical data backs this up
• In many cases as few as 10 RDTs are sufficient, and in almost all 30 are.  In the worst case, the lower confidence bound on the prediction of a RDT is 1 – (1 – 1/k)^N, where N is the number of RDTs used.   Empirical data backs this up.
• The training cost is extremely low; the construction of one tree itself is o(k), and the population of proper statistics across all trees can be done in a single sweep across the data.
• This method also lends itself in a straightforward manner to online learning
• Demonstrate that in a number of domains (a few used in classification/regression contests), RDTs outperform pruned, unpruned, and calibrated C4.5 trees.
• In other tests, RDTs perform comparably (albeit slightly worse) than boosted decision trees, and almost exactly equal to bagged decision trees.
• From my experience, the running time of training traditional decision trees is generally not an issue, and neither is the cost of boosting, but RDTs are still much faster, and can function online, which decision trees can’t do in a simple manner (although there are algorithms to do this such as ITI)
• These are in both 0-1 loss and cost-sensitive loss settings

On the Optimality of Probability Estimation by Random Decision Trees, Wei Fan, (AAAI 04?)

• Learning an optimal decision tree is NP-Hard in general (I think this is due to the xor problem at least), so decision tree learners generally use some heuristic such as information gain of 1 feature at a time.  This gets rid of guarantees of optimality, but makes the problem tractable
• Most decision tree learners must perform a sweep over the entire data set at each branch creation
• In RDTs multiple splits can be made on the same feature as branches are made further down in the tree.  Clearly this doesn’t make sense in the case of binary variables
• In actual implementations with RDTs it may be desirable to actually look at all the data while creating the tree, so that branches that would have lead to no data in the leaves can be pruned out early
• Empirically, each random tree tends to be about 3x as large as a learned decision tree
• Generally want at least 30 trees because that is the point where the T-distribution becomes highly similar to the Normal distribution.  For skewed data, up to 50 RDTs have been useful
• Data sets used here are same as in Is random model better? On its accuracy and efficiency
• RDTs do not seem to overfit, especially when compared to learned decision trees
• Point out that learned decision trees do not output results that vary smoothly as data changes, which RDTs do more or less

Effective Estimation of Posterior Probabilities: Explaining the Accuracy of Randomized Decision Tree Approaches, Wei Fan, Ed Greengrass, Joe McCloskey, Philip S. Yu, Kevin Drummey (ICDM 05?)

• RDTs estimate a function using a subset of all possible decision trees s.t. the subset used are all consistent with the training data
• Is claimed RDTs are a Bayesian Optimal Classifier (read up on this)
• Goal of paper is to demonstrate RDTs approximate true probabilities better than learned decision trees
• Use a metric called improved squared error that has the property of being nonzero only when the wrong decision boundary is used
• In binary 0-1 loss, this means if you made the wrong decision, your prediction for the probability of the correct label was less than 0.5 when it should have been greater.  How much less that 0.5 your prediction was is how much error will be recorded (squared)
• Cross entropy is undefined when there are 0 probabilities, which can make it problematic
• On a synthetic dataset (used so that actual probabilities can be computed and compared), RDTs outperform learned decision trees by a significant margin.  Random forests (RF), which are a group of perturbed learned decision trees, have performance comparable to RDTs (although slightly worse)
• RDTs and RFs demonstrate a significant reduction in variance as compared to  learned decision trees (LDT)
• In datasets previously used for RDTs, RDTs beat RFs

## Comparing size and accuracy of decision trees with and without synchronic arcs

Right, well when predicting x’, the feature vector is (x, y, z, y’, z’), but not x’, so all the variables aside from the one we’re trying to predict with that decision tree are available as inputs.

oh, I see… so, that’s a bit stronger than needed, but if that
doesn’t produce much smaller trees, then I guess it’s harder than I
thought

In response to this, I reran the taxi/decision tree experiment with and without synchronic arcs, and compared the size of the decision trees and their accuracy in both cases.  In all cases, the accuracy of the decision trees with synchronic arcs was better (or at least as good) as the decision tree without.  As far as size goes, in some cases the trees with synchronic arcs are larger and in some cases they are not.

In order to avoid divide by zero problems, both classifiers are said to have at least one error; for some attributes the decision trees can perfectly predict the value with no error at all.

```Comparing: @attribute taxi_x-0 {0,1,2,3,4}
(Tree size w/o current / Tree size w/ current): 1.0
(Num leaves w/o current / Num leaves w/ current): 1.0
(Error rate w/o current / Error rate w/ current): 1.0

Comparing: @attribute taxi_y-0 {0,1,2,3,4}
(Tree size w/o current / Tree size w/ current): 1.0
(Num leaves w/o current / Num leaves w/ current): 1.0
(Error rate w/o current / Error rate w/ current): 1.0

Comparing: @attribute taxi_delta_x-0 {-1,0,1}
(Tree size w/o current / Tree size w/ current): 2.625
(Num leaves w/o current / Num leaves w/ current): 3.3636363636363638
(Error rate w/o current / Error rate w/ current): 1.0

Comparing: @attribute taxi_delta_y-0 {-1,0,1}
(Tree size w/o current / Tree size w/ current): 2.625
(Num leaves w/o current / Num leaves w/ current): 3.3636363636363638
(Error rate w/o current / Error rate w/ current): 1.0

Comparing: @attribute passenger_x-0 {0,1,2,3,4}
(Tree size w/o current / Tree size w/ current): 0.46153846153846156
(Num leaves w/o current / Num leaves w/ current): 0.5
(Error rate w/o current / Error rate w/ current): 1.7999999999999998

Comparing: @attribute passenger_y-0 {0,1,2,3,4}
(Tree size w/o current / Tree size w/ current): 0.3
(Num leaves w/o current / Num leaves w/ current): 0.3333333333333333
(Error rate w/o current / Error rate w/ current): 19.0

Comparing: @attribute destination_x-0 {0,1,2,3,4}
(Tree size w/o current / Tree size w/ current): 1.0
(Num leaves w/o current / Num leaves w/ current): 1.0
(Error rate w/o current / Error rate w/ current): 1.0

Comparing: @attribute destination_y-0 {0,1,2,3,4}
(Tree size w/o current / Tree size w/ current): 1.0
(Num leaves w/o current / Num leaves w/ current): 1.0
(Error rate w/o current / Error rate w/ current): 1.0

Comparing: @attribute action-0 {U,D,W,E,PickUp,DropOff,Refuel}
(Tree size w/o current / Tree size w/ current): 0.9667341934108113
(Num leaves w/o current / Num leaves w/ current): 0.9798368971938611
(Error rate w/o current / Error rate w/ current): 1.0632673515409672

Comparing: @attribute fuel-0 {-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13}
(Tree size w/o current / Tree size w/ current): 0.995260663507109
(Num leaves w/o current / Num leaves w/ current): 0.8274111675126904
(Error rate w/o current / Error rate w/ current): 25.999999999999996

Comparing: @attribute fuelDelta-0 {-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13}
(Tree size w/o current / Tree size w/ current): 3.3157894736842106
(Num leaves w/o current / Num leaves w/ current): 2.6
(Error rate w/o current / Error rate w/ current): 25.999999999999996

Comparing: @attribute reward-0 {20,-1,-10,-20}
(Tree size w/o current / Tree size w/ current): 2.022727272727273
(Num leaves w/o current / Num leaves w/ current): 1.605263157894737
(Error rate w/o current / Error rate w/ current): 21.999999999999996

Comparing: @attribute hasPassenger-0 {true,false}
(Tree size w/o current / Tree size w/ current): 0.9135802469135802
(Num leaves w/o current / Num leaves w/ current): 0.7971014492753623
(Error rate w/o current / Error rate w/ current): 20.0

Comparing: @attribute terminal-0 {true,false}
(Tree size w/o current / Tree size w/ current): 2.4
(Num leaves w/o current / Num leaves w/ current): 2.25
(Error rate w/o current / Error rate w/ current): 8.999999999999998

Comparing: @attribute wallN-0 {true,false}
(Tree size w/o current / Tree size w/ current): 1.2027027027027026
(Num leaves w/o current / Num leaves w/ current): 1.25
(Error rate w/o current / Error rate w/ current): 1.5913978494623655

Comparing: @attribute wallS-0 {true,false}
(Tree size w/o current / Tree size w/ current): 1.0704225352112675
(Num leaves w/o current / Num leaves w/ current): 1.117283950617284
(Error rate w/o current / Error rate w/ current): 2.5555555555555554

Comparing: @attribute wallW-0 {true,false}
(Tree size w/o current / Tree size w/ current): 2.15311004784689
(Num leaves w/o current / Num leaves w/ current): 2.33125
(Error rate w/o current / Error rate w/ current): 5.446808510638299

Comparing: @attribute wallE-0 {true,false}
(Tree size w/o current / Tree size w/ current): 3.0174418604651163
(Num leaves w/o current / Num leaves w/ current): 3.111940298507463
(Error rate w/o current / Error rate w/ current): 6.925925925925926
```

## Using decision trees for modelling Taxi, 2

> I’m not sure I follow what you mean.  Do you mean using features
> from both the previous and current time step to predict the current
> time step?  Could you give an example?

Right!  In dynamic Bayes net terminology, they are called

Take a look at

If you search for synchronic, you can find some discussions of the
importance of this kind of representation in stochatic domains.  There
isn’t much in terms of deterministic domains, so here’s a quick
example.

Imagine data that looks like this (three binary variables):

000 -> 001
001 -> 010
010 -> 011
011 -> 100
100 -> 101
101 -> 110
110 -> 111
111 -> 000

(It’s a binary counter with wraparound.)

Let’s say x,y,z are the values before the transition and x’,y’,z’
are the values afterwards.  Consider learning x’ from x,y,z.  The
decision tree looks like

if x = 0
if y = 1
if z = 1, then x’ = 1
else, x’ = 0
else, x’ = 0
else,
if y = 1,
if z = 1, x’ = 0
else, x’ = 1
else, x’ = 1

But, what if y’ is part of the input as well?  Then, we can ignore z
and make a tree like:

if x = 0,
if y = 1,
if y’ = 0, x’ = 1
else, x’ = 0
else, x’ = 0
else,
if y = 1,
if y’ = 0, x’ = 0
else, x’ = 1
else, x’ = 1

Admittedly, that’s not much better, but a longer bit counting
example would should an advantage, because every bit would just need
to look at its value and the value of the neighbor and its previous
value.  There’s also some more advantage if we’re predicting delta x:

if y = 1,
if y’ = 0, delta x = 1
else, delta x = 0
else, delta x = 0

It was pretty easy to rerun the experments with synchronic links.  The decision trees for all variables is uploaded here.  I fixed the mistake with the fuel not changing on a failed location change, and changed the rewards from before (as far as the rewards go, its not totally clear to me which is right from the paper, but this seems better).

Note there are some instances where a prediction does not make any sense (isn’t possible given the environment), and is followed by (0.0) for the confidence.  This is just because if weka has no data on a given state, it makes one up so there is a leaf for all possible values in the tree when using class instead of scalar data, which is how I had to set it up here.

Each variable has a suffix -X where X is the number of time steps before the present step predicted.  So foo-1 means attribute foo on the previous time step, and foo-0 means the attribute foo at the current time step.

Some outtakes are:

```Decision tree for @attribute taxi_delta_x-0 {-1,0,1}
------------------
action-1 = U: 0 (3740.0)
action-1 = D: 0 (3744.0)
action-1 = W
| wallW-1 = true: 0 (1814.0)
| wallW-1 = false
| | wallE-0 = true: 0 (211.0)
| | wallE-0 = false: -1 (1726.0)
action-1 = E
| wallE-1 = true: 0 (1864.0)
| wallE-1 = false
| | wallW-0 = true: 0 (210.0)
| | wallW-0 = false: 1 (1747.0)
action-1 = PickUp: 0 (3657.0)
action-1 = DropOff: 0 (3716.0)
action-1 = Refuel: 0 (3651.0)
```
```Decision tree for @attribute fuelDelta-0 {-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13}
J48 pruned tree
------------------
action-1 = U: -1 (3740.0)
action-1 = D: -1 (3744.0)
action-1 = W: -1 (3751.0)
action-1 = E: -1 (3821.0)
action-1 = PickUp: 0 (3657.0)
action-1 = DropOff: 0 (3716.0)
action-1 = Refuel: 0 (3651.0/135.0)
Decision tree for @attribute reward-0 {20,-1,-10,-20}
------------------
terminal-0 = true: -20 (1500.0/13.0)
terminal-0 = false
| fuelDelta-0 = -1: -1 (13569.0)
| fuelDelta-0 = 0
| | action-1 = U: -10 (0.0) [note these are wrong, but have...
| | action-1 = D: -10 (0.0) ... no supporting evidence either way]
| | action-1 = W: -10 (0.0)
| | action-1 = E: -10 (0.0)
| | action-1 = PickUp
| | | hasPassenger-0 = true
| | | | hasPassenger-1 = true: -10 (109.0)
| | | | hasPassenger-1 = false: -1 (86.0)
| | | hasPassenger-0 = false: -10 (3462.0)
| | action-1 = DropOff: -10 (3703.0)
| | action-1 = Refuel: -1 (3516.0)
| fuelDelta-0 = 1: -1 (9.0)
| fuelDelta-0 = 2: -1 (7.0)
...
| fuelDelta-0 = 12: -1 (10.0)
| fuelDelta-0 = 13: -1 (0.0)
Decision tree for @attribute terminal-0 {true,false}
------------------
reward-0 = 20: true (13.0)
reward-0 = -1: false (17306.0)
reward-0 = -10: false (7274.0)
reward-0 = -20: true (1487.0)
```

## Using decision trees for modelling Taxi

In an attempt to construct a model for the non-fickle Taxi (The MAXQ Method for Hierarchical Reinforcement Learning, Dietterich), I made a rich feature vector and then attempted to train decision trees based on the traces of random actions.

The feature vector I used was as follows:

1. taxi (x,y)
2. taxi delta (x,y)
3. passenger (x,y)
4. destination (x,y)
5. action
6. fuel
7. fuel delta
8. reward
9. has passenger?
10. terminal?
11. wall to North?
12. wall to South?
13. wall to West?
14. wall to East?

From there I trained a J48/C4.5 decision tree which was the result of 1,500 trajectories through taxi when using a (uniform) random policy.

Some of the results are below.  When there is some value listed as <attribute name>-1, it means that it happened one step before the attribute that is being predicted.  For all integer values attributes, I had to have weka treat them as classes instead of numbers so that it could predict those values as well (which means there are no generalizations between numbers, a different decision tree classifier that performs regression and not just classification wouldn’t have this limitation).  Even still things seem to be quite good.  Here some of them are:

```Decision tree for @attribute taxi_delta_x-0 {-1,0,1}
------------------
action-1 = U: 0 (3129.0)
action-1 = D: 0 (3133.0)
action-1 = W
|   wallW-1 = true: 0 (1569.0)
|   wallW-1 = false: -1 (1607.0)
action-1 = E
|   wallE-1 = true: 0 (1595.0)
|   wallE-1 = false: 1 (1642.0)
action-1 = PickUp: 0 (3351.0)
action-1 = DropOff: 0 (3547.0)
action-1 = Refuel: 0 (3426.0)

Decision tree for @attribute taxi_delta_y-0 {-1,0,1}
------------------
action-1 = U
|   wallN-1 = true: 0 (692.0)
|   wallN-1 = false: -1 (2437.0)
action-1 = D
|   wallS-1 = true: 0 (780.0)
|   wallS-1 = false: 1 (2353.0)
action-1 = W: 0 (3176.0)
action-1 = E: 0 (3237.0)
action-1 = PickUp: 0 (3351.0)
action-1 = DropOff: 0 (3547.0)
action-1 = Refuel: 0 (3426.0)

Decision tree for @attribute fuel-0 {0,1,2,3,4,5,6,7,8,9,10,11,12,13}
------------------
...
fuel-1 = 1: 1 (1548.0/3.0)
fuel-1 = 2
|   action-1 = U
|   |   wallN-1 = true: 2 (74.0)
|   |   wallN-1 = false: 1 (303.0)
|   action-1 = D
|   |   wallS-1 = true: 2 (86.0)
|   |   wallS-1 = false: 1 (267.0)
|   action-1 = W
|   |   wallW-1 = true: 2 (214.0)
|   |   wallW-1 = false: 1 (195.0)
|   action-1 = E
|   |   wallE-1 = true: 2 (147.0)
|   |   wallE-1 = false: 1 (229.0)
|   action-1 = PickUp: 2 (361.0)
|   action-1 = DropOff: 2 (384.0)
|   action-1 = Refuel: 2 (349.0/6.0)
[although the fuel = 1 is weird, the rest of
them are fine, but replicate for each value]
...

Decision tree for @attribute fuelDelta-0 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,-1}
------------------
action-1 = U
|   wallN-1 = true: 0 (692.0)
|   wallN-1 = false: -1 (2437.0)
action-1 = D
|   wallS-1 = true: 0 (780.0)
|   wallS-1 = false: -1 (2353.0)
action-1 = W
|   wallW-1 = true: 0 (1569.0)
|   wallW-1 = false: -1 (1607.0)
action-1 = E
|   wallE-1 = true: 0 (1595.0)
|   wallE-1 = false: -1 (1642.0)
action-1 = PickUp: 0 (3351.0)
action-1 = DropOff: 0 (3547.0)
action-1 = Refuel: 0 (3426.0/96.0)

Decision tree for @attribute terminal-0 {true,false}
------------------
fuel-1 = 0: false (0.0)
fuel-1 = 1
|   action-1 = U
|   |   wallN-1 = true: false (127.0)
|   |   wallN-1 = false: true (440.0)
|   action-1 = D
|   |   wallS-1 = true: false (124.0)
|   |   wallS-1 = false: true (453.0)
|   action-1 = W
|   |   wallW-1 = true: false (241.0)
|   |   wallW-1 = false: true (291.0)
|   action-1 = E
|   |   wallE-1 = true: false (235.0)
|   |   wallE-1 = false: true (307.0)
|   action-1 = PickUp: false (560.0)
|   action-1 = DropOff: false (559.0/1.0)
|   action-1 = Refuel: false (515.0)
fuel-1 = 2: false (3769.0)
fuel-1 = 3: false (3835.0/1.0)
fuel-1 = 4: false (4127.0/1.0)
fuel-1 = 5: false (4286.0/1.0)
fuel-1 = 6: false (3600.0/2.0)
fuel-1 = 7: false (3241.0/1.0)
fuel-1 = 8: false (2810.0)
fuel-1 = 9: false (2196.0)
fuel-1 = 10: false (1693.0/1.0)
fuel-1 = 11: false (1250.0)
fuel-1 = 12: false (808.0/1.0)
fuel-1 = 13: false (0.0)

Decision tree for @attribute reward-0 {19,-1,-11,-21}
------------------
...
[Dynamics for each direction looks basically same
except for the wall adjacency test is different of course]
action-1 = E
|   fuel-1 = 0: -1 (0.0)
|   fuel-1 = 1
|   |   wallE-1 = true: -1 (235.0)
|   |   wallE-1 = false: -21 (307.0)
|   fuel-1 = 2: -1 (559.0)
|   fuel-1 = 3: -1 (586.0)
|   fuel-1 = 4: -1 (603.0)
|   fuel-1 = 5: -1 (582.0)
|   fuel-1 = 6: -1 (481.0)
|   fuel-1 = 7: -1 (452.0)
|   fuel-1 = 8: -1 (415.0)
|   fuel-1 = 9: -1 (302.0)
|   fuel-1 = 10: -1 (251.0)
|   fuel-1 = 11: -1 (180.0)
|   fuel-1 = 12: -1 (118.0)
|   fuel-1 = 13: -1 (0.0)
action-1 = PickUp: -11 (5057.0/103.0)
action-1 = DropOff: -11 (5173.0/9.0)
action-1 = Refuel: -1 (5048.0)
...
```

So those are the most interesting ones, others are less pretty.  For full disclosure I tried uploading all of the resulting trees, but something is wrong with wordpress right now.

Anyway, I think thats a good/interesting start to the feature construction discussion.  I did force feed a number of variables to make this all easier, but its tough for the decision trees to function without them.  Other regression/classification algorithms may not need then, but the nice thing about the decision tree approach is that it is very clear which variables are found to contribute and which are not.