What can 20,000 models teach us? Lessons learned from a large-scale empirical comparison of learning algorithms Alexandru Niculescu-Mizil. Talk


  1. This talk covered the speaker’s experience in implementing a large number of supervised learning methods and his observations of performance.
  2. There were some parts that some members of the audience disagreed with, such as the way probabilities were inferred from the classifiers (not all of them are designed to allow this, so making this happen is a bit off-label), and the way regression algorithms were made to perform classification is a problem not restricted to this talk, but overall there was definitely a large amount of useful information.  Its also worth pointing out the speaker was very open about where we should draw conclusions from results and where not to, which was refreshing.
    1. Also worth pointing out that he found new parameterizations for each validation fold, which I thought is normally not done, I think this helps brittle algorithms too much
    2. He had many different metrics and combined them in a way I thought could be improved upon (although it was reasonable).  Things were normalized between performance probability levels and the best algorithm; I would have used z-scores or something similar.
  3. The speaker pointed out the problems he worked on were of medium size (for classification, binary, 10-200 features, 5000 (I think I dropped a zero samples)), so this is not big data, but definitely medium-sized
  4. 10 learning algorithms with numerous parameterizations tested, which is where the 20,000 models comes from
  5. The performance of the algorithms as initially tested was surprising.   The order from best to worst was:
    1. Bagged Trees
    2. Random Forests
    3. ANNs (I was very surprised to see them this high)
    4. Boosted Trees (I expected these above bagged trees and random forests)
    5. KNN (I thought this would be lower)
    6. SVMs (This low?  I feel like something may have been funny with the implementation or parameter search)
    7. Boosted Stumps
    8. Decision Trees (This was a shocker)
    9. Logistic Regression
    10. Naive Bayes (Naturally)
  6. Boosted trees in particular was #1 in the most categories, but was particularly bad in a few
  7. Also, almost every algorithm was best in at least one dataset and/or metric overall, so the speaker was clear in pointing out that this list should only inform you of the order to try methods on your particular problem until you find one that works
  8. Discussed the use of Platt scaling and Isotonic regression to correct some problems in boosted trees, after which the algorithms performed better than anything else.  I had not heard of these approaches before.
  9. There was a lot of focus on calibration in the talk
  10. The punchline?  Use ensemble methods.  The ensembles by far had consistently the best performance, unlike the best non-ensemble methods, that were outperformed depending on the task.
  11. Overall, he said the best performance metric is cross entropy.  I am familiar with this phrase in terms of optimization only.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: