Large-scale Video Classification with Convolutional Neural Networks. Karpathy, Toderici, Shetty, Leung, Sukthankar, Fei-Fei


  1. Convolutional NN on classification of 1 million youtube videos
  2. Propose a “multilevel, foveated architecture” to speed training
  3. Created a new dataset for this paper: 1-million sports videos belonging to 487 classes of sports (about 1k-3k vids/class)
  4. They are interested in video (so need to factor in temporal information, unlike more common image classification)
  5. Because the networks for learning video have to be so huge, propose a 2-part system to make learning tractable, which involves a context stream on low-res and fovea stream on high-res.  This way is about 3x as fast as naive approach and as accurate
  6. They also use the features learned from this dataset to then improve performance on a smaller corpus (going from 41.3% to 65.4%)
  7. A common approach to video classification involves processing (into something analogous to a bag of words model) and then throwing into a SVM
  8. Previous activity recognition benchmarks are small in terms of sample size, which don’t work well with NNs, so they propose a huge data set
  9. Unlike images that can somewhat easily be rescaled to the same format, there is more variability in video, including the addition of temporal length
  10. Here they treat videos as a bag of short fixed-length clips
  11. They considered at least 4 different classes of topologies for the task, all fundamentally different (structurally, not in terms of how/where pooling is done for example)
    1. Basically, this comes down to where you deal with the temporal aspect, is it at the input layer, or is the input processed as a still and then merged with temporal data further up?
  12. Different architectures
    1. Single-Frame: just process stills alone
    2. Early-Fusion: have the first convolutional layer extend back some number of frames in the past
    3. Late-Fusion: Starts with 2 single-frame networks, (for past 2 frames) and merges their results in “the first fully connected layer”
    4. Slow-Fusion: A more graduated version of merging data (lower levels have less temporal data, higher levels have more)
  13. In order to speed training they tried to reduce the number of weights, but it made classification worse.  Making images lower res sped things up too but also made perf worse.  Then they did the 2-part system (context, fovea)
    1. Context is 178 x 178 but whole pic
    2. Fovea is 89 x 89 but centered with original resolution
  14. Optimize NN with “Downpour Stochastic Gradient Descent”
  15. Use data augmenting to prevent overfitting
  16. Training took a month, although they say their perf isn’t optimized and could run better on GPUs
  17. They also tried to run the features generated by their NN through linear classifiers, but letting the NN do the classification worked better
  18. Lots of incorrectly labeled videos, still performance is said to be good
  19. Slow fusion works the best (although difference between others isn’t enormous)
  20. Camera motion can mess up the classifiers
  21. Errors tend to be in related areas (ex hiking vs backpacking)
  22. Then learning transfer to smaller UCF-101 database
    1. Worked best when retraining the top 3 layers of the net
Advertisements
Tagged ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: