Very Deep Convolutional Networks for Large-scale Image Recognition. Simonyan, Zisserman. ICLR 2015


  1. Discusses approach that got 1st, 2nd place in imagenet challenge 2014
  2. Basic idea is to use very small convolutions (3×3) and a deep network (16-19 layers)
  3. Made the implementation public
  4. Works well on other data sets as well
  5. Last year people moved to make smaller receptive windows, smaller stride, and using training data more thoroughly, (at multiple scales)
  6. 224×224: only preprocessing is doing mean-subtraction of RGB values for each pixel
  7.  “local response normalization” didnt help performance and consumed more memory
  8. Earlier state of the art used 11×11 convolutions w/stride 4 (or 7×7 stride 2)
    1. Here they only did 3×3 with stride 1
    2. They also have 3 non-linear rectification layers instead of 1, so the decisions made by those layers can be more flexible
  9. Their smaller convolutions have a much smaller number of parameters, which can be seen as a form of regularization
  10. Optimized multinomial logistic regression using minibatch (size 256) gradient descent from backprop + momentum.
  11. “The training was regularised by weight decay (the L2 penalty multiplier set to 5 · 10−4 ) and dropout regularisation for the first two fully-connected layers (dropout ratio set to 0.5).”
    1. <How does this weight decay work exactly?  Need to check it out>
  12. Ends up training faster than Krizhevsky et al., 2012’s network because of some pretraining, and also because the network is narrower, but deeper (more regularized)
    1. Pretrain 1st 4 convolutional layers, and last 3 fully connected layers
    2. They found out later that pretraining wasn’t really needed if they used a particular random initialization procedure
  13. Implementation based on Caffe, including very efficient paralleization
  14. With 4 Titan GPUs, took 2-3 weeks to train
  15. Adding further layers didn’t improve performance, although they say it might have if the data set was even larger
  16. “scale jittering helps” <i guess this has to do with how images are cropped and scaled to fit in 224×224, and randomizing this process a bit helps>
  17. “Notably, we did not depart from the classical ConvNet architecture of LeCun et al. (1989), but improved it by substantially increasing the depth.”
  18. Method was simpler than a number of other near state-of-the-art
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: