Ideas from papers on Random Decision Trees

Is random model better? On its accuracy and efficiency, Wei Fan, Haixun Wang, Philip S. Yu, and Sheng Ma (ICDM 03?)

  • Basic idea is that instead of trying to build one optimal classifier, build many unsophisticated ones and average the predictions over all unsophisticated classifiers
  • The specific proposal here is to use random decision trees, which are trees that terminate at a predetermined depth and branch on a random feature at a random boundary at each branch
    • Can be modified to accept missing features
    • This method doesn’t fall into the classic xor problem that most incremental decision tree learners fall into – RDTs don’t even look at the data at the time the tree is built
    • Argument that it is optimal to maximize tree diversity.  For example, we could build trees so high that each sample is recorded exactly in the tree, but this isn’t what we are looking for because we want to average over many trees, and when the trees are all too deep they will all give basically the same prediction
      • Based on simple combinatorics, if there are k features, the largest number of possible ways of combining a subset (of size i) of those features is  i choose k, which is maximal at i = k/2
      • This means maximal diversity is reached when the tree depth is capped at k/2.  Empirical data backs this up
      • In many cases as few as 10 RDTs are sufficient, and in almost all 30 are.  In the worst case, the lower confidence bound on the prediction of a RDT is 1 – (1 – 1/k)^N, where N is the number of RDTs used.   Empirical data backs this up.
      • The training cost is extremely low; the construction of one tree itself is o(k), and the population of proper statistics across all trees can be done in a single sweep across the data.
        • This method also lends itself in a straightforward manner to online learning
        • Demonstrate that in a number of domains (a few used in classification/regression contests), RDTs outperform pruned, unpruned, and calibrated C4.5 trees.
          • In other tests, RDTs perform comparably (albeit slightly worse) than boosted decision trees, and almost exactly equal to bagged decision trees.
          • From my experience, the running time of training traditional decision trees is generally not an issue, and neither is the cost of boosting, but RDTs are still much faster, and can function online, which decision trees can’t do in a simple manner (although there are algorithms to do this such as ITI)
          • These are in both 0-1 loss and cost-sensitive loss settings

On the Optimality of Probability Estimation by Random Decision Trees, Wei Fan, (AAAI 04?)

  • Learning an optimal decision tree is NP-Hard in general (I think this is due to the xor problem at least), so decision tree learners generally use some heuristic such as information gain of 1 feature at a time.  This gets rid of guarantees of optimality, but makes the problem tractable
    • Most decision tree learners must perform a sweep over the entire data set at each branch creation
    • In RDTs multiple splits can be made on the same feature as branches are made further down in the tree.  Clearly this doesn’t make sense in the case of binary variables
    • In actual implementations with RDTs it may be desirable to actually look at all the data while creating the tree, so that branches that would have lead to no data in the leaves can be pruned out early
    • Empirically, each random tree tends to be about 3x as large as a learned decision tree
    • Generally want at least 30 trees because that is the point where the T-distribution becomes highly similar to the Normal distribution.  For skewed data, up to 50 RDTs have been useful
    • Data sets used here are same as in Is random model better? On its accuracy and efficiency
    • RDTs do not seem to overfit, especially when compared to learned decision trees
    • Point out that learned decision trees do not output results that vary smoothly as data changes, which RDTs do more or less

Effective Estimation of Posterior Probabilities: Explaining the Accuracy of Randomized Decision Tree Approaches, Wei Fan, Ed Greengrass, Joe McCloskey, Philip S. Yu, Kevin Drummey (ICDM 05?)

  • RDTs estimate a function using a subset of all possible decision trees s.t. the subset used are all consistent with the training data
  • Is claimed RDTs are a Bayesian Optimal Classifier (read up on this)
  • Goal of paper is to demonstrate RDTs approximate true probabilities better than learned decision trees
  • Use a metric called improved squared error that has the property of being nonzero only when the wrong decision boundary is used
    • In binary 0-1 loss, this means if you made the wrong decision, your prediction for the probability of the correct label was less than 0.5 when it should have been greater.  How much less that 0.5 your prediction was is how much error will be recorded (squared)
    • Cross entropy is undefined when there are 0 probabilities, which can make it problematic
    • On a synthetic dataset (used so that actual probabilities can be computed and compared), RDTs outperform learned decision trees by a significant margin.  Random forests (RF), which are a group of perturbed learned decision trees, have performance comparable to RDTs (although slightly worse)
    • RDTs and RFs demonstrate a significant reduction in variance as compared to  learned decision trees (LDT)
    • In datasets previously used for RDTs, RDTs beat RFs
Tagged , , , , , , , ,

3 thoughts on “Ideas from papers on Random Decision Trees

  1. Michael Littman says:

    “Generally want at least 30 trees because that is the point where the T-distribution becomes highly similar to the Normal distribution.”

    I must not be thinking about these trees the right way. What does the number of trees have to do with the law of large numbers?

  2. hut3 says:

    The idea is that in order to increase estimation diversity (which is suggested to be useful in the stochastic setting), the trees do not even make one split on each dimension. For that reason, the goal is to have many diverse predictions combined so that in the end the ensemble result factors in splits from various locations in the state space and along each dimension at least once.

  3. Michael Littman says:

    That sounds correct, but that wasn’t quite what I was asking. But, maybe that’s not so important right now.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: