ImageNet Classification with Deep Convolutional Neural Networks. Krizhevsky, Sustkever, Hinton.

State of the art image recognition at time of publication.  Discussed in Hinton’s videolectures.

  1. The system that won the ImageNet LSVRC-2012 competition, with a 15.3% error rate, as compared to the next best at 26.2%
    1. Competition allows algorithm to guess 5 most likely classes of item shown in image
  2. Network has 60 million parameters, 650,000 neurons, 5 convolutional layers with additional max-pooling layers, and a final 1,000 way <1000 possible categories> softmax
  3. GPU implemented
  4. Training with dropout (randomly remove nodes while training)
  5. The scale of this task makes it much more challenging than classical MNIST digit classification
  6. “To learn about thousands of objects from millions of images, we need a model with a large learning capacity.  However, the immense complexity of the object recognition task means that this problem cannot be specified even by a dataset as  large as ImageNet, so our model should also have lots of prior knowledge to compensate for all the data we don’t have.”
  7. Convolutional neural networks (CNNs) learn much fewer parameters by replicating weights in multiple places, which reduces the parameter space, making them “… easier to train, while their theoretically-best performance is likely to be only slightly worse.” <Really need a citation on a statement like that>
  8. “In the end, the network’s size is limited mainly by the amount of memory available on current GPUs and by the amount of training time we are willing to tolerate… All of out experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.”
  9. The ImageNet database has over 15 million high-res images over 22,000 categories, each ImageNet competition uses a subset of 1,000 categories
  10. “The architecture of our network… contains eight learned layers — five convolutional and three fully-connected.”
  11. Standard neuron activation is according to a tanh function, but it turns out even in small networks this leads to very slow training.  Here “rectified linear units are used” which are just linear units with a minimal activation of 0
  12. 2 GPUs are used because that is the only way there was enough memory (shared across devices, and does that fast), half of the nodes/weights are on each GPU, but they naturally try to minimize cross-GPU communication
    1. The two-gpu network is actually slightly faster to train as compared to the single GPU, many of the weights cannot be removed
  13. Normalization of activity is also done – more specifically, there is inhibition between adjacent neurons
  14. Does pooling which “summarize[s] the outputs of neighboring groups of neurons in the same kernel map <don’t know what that means>.”
    1. Pooling done here actually has a bit of overlap
  15. Exact details of the network topology is complex (4 paragraphs or so) <so not taking notes on it here>
  16. Get more mileage out of the training set by
    1. Subsampling/translating original 256×256 images to 224×224.  Increases training size by factor of 2k – even though the new data is highly dependent on each other, it keeps the network from overfitting, which it otherwise does without this trick
    2. Noisyfying data
  17. Reduce overfitting by dropout, which means that the output of each hidden neuron will be 0 (effectively deleting it) with probability 0.5
    1. “So every time an input is presented, the neural network samples a different architecture… This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence if particular other neurons.  It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.”
    2. At test time, use all neurons, but halve their activation, which is the expectation of this
  18.  Training using stochastic gradient descent, batch size 128
  19. Performance destroys other methods
  20. <Moves on to qualitative evaluations>
  21. Interestingly, GPU1 learns color agnostic features, while GPU2 is color-dependent
    1. <Not surprisingly, at least for black and white they look like Gabors.  The colored features may be equivalent, but with a very low frequency?>
  22. Can also use the network to find “similar” images based on activation of the final layer.  <The results are surprisingly good>
  23. Removing a hidden layer causes about a 2% drop in performance
  24. Didn’t do any pre-training

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: