Building High-level Features Using Large Scale Unsupervised Learning. Le, Ranzato, Monga, Devin, Chen, Corrado, Dean, Ng. ICML 2012

<Notes based on Arxiv version>

<I have a somewhat strange observation, but the input space considered is 200×200 images, RGB (lets assume 256 shades, this means there are about 30 million possible inputs, but the model has 1 billion connections (although perhaps some are held fixed?) – still, it seems like the solution to the problem is larger than the problem itself?>

  1. Concerned with building class-specific feature detectors from unlabeled data (ex/ face detector from random images)
  2. Attempted w/ “… 9-layered locally connected sparse autoencoder with pooling and local contrast normalization…” Network has 1 billion connections
    1. Trained on 10 million 200×200 images
  3. Used stochastic gradient descent to train network trained over 16,000 cores (1,ooo machines), training took 3 days
  4. “Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not.”
  5. Robust to all sorts of image transformations
  6. Because net isn’t trained to recognize faces, it is also sensitive to all sorts of other things – a few identified are human bodies and cat faces
  7. “Starting with these learned features, we trained our network to obtain 15.8% accuracy in recognizing 22,000 object categories from ImageNet, a leap of 70% relative improvement over the previous state-of-the-art.”
    1. <In contrast, Krizhevsky et. als model gets 15.3% accuracy from a week of processing on a single desktop machine, although it is in a simple 10,000 category task – still I think the future will be in GPUs here.  I think that approach only uses the labeled data in the dataset though, so this and that are a little apples and oranges>
  8. The approach of building a class identifier from unlabeled images is inspired by class-specific so-called”grandmother neurons” that exist in the brain
    1. Evidence that these respond either to classes such as faces or hands, or even particular people
  9. Almost all vision approaches for class identification rely on labelled data – “Although approaches that make use of inexpensive unlabeled data are often preferred, they have not been shown to work well for building high-level features.”
  10. 2 motivations for this work
    1. Is it possible to develop features for a classifier from unlabeled data?
    2. Can grandmother neurons possibly be learned from unlabeled data? “Informally, this would suggest that it is at least in principle possible that a baby learns to group faces into one class because it has seen many of them and not because it is guided by supervision or rewards.”
  11. Previous work on unlabeled data has been used to learn low-level feature detectors such as edge or blob detectors, as opposed to high-level features explored here that code for particular classes
    1. One possible reason high-level feature detectors haven’t been found from training on unlabeled data is the computational requirements – its extremely rare that someone tries an experiment on this scale, and that may be what is necessary to actually develop such feature detectors.
  12. Images are pulled from youtube
  13. Uses “local receptive fields” <not sure what that is> to reduce cross-machine communication
  14. The openCV face detector found faces in less than 3% of the images trained on, that corpus ultimately developed an accurate face detector even though faces seem to be a very small part of that corpus
  15. Cites Olshausen & Field’s paper on sparse coding <In paper queue> “… sparse coding can be trained on unlabeled images to yield receptive fields akin to V1 simple cells (…)”
  16. Early sparse coding work used shallow architectures that do low-level stuff like Gabors, simple invarainces
  17. Network “… can be viewed as a sparse deep autoencoder with tree important ingredients: local receptive fields, pooling and local contrast normalization.
  18. Uses local receptive fields <as mentioned above>.  “This biologically inspired idea proposes that each feature in the autoencoder can connect only to a small region of the lower layer.”
  19. L2 pooling to “… achieve invariance to local deformations… allows the learning of invariant features (…).”
  20. Network has 3 major layers, with each major layer containing single layers for local filtering, local pooling, and local contrast normalization (so altogether the network has depth 3×3=9)
  21. “The first and second sublayers are often known as filtering (or simple) and pooling (or complex) respectively.  The third sublayer performs local subtractive and divisive normalization and it is inspired by biological and computational models (…).”  Its argued this structure exists in the brain
    1. 1st sublayer has receptive fields over 18×18 pixels, 2nd pools over 5×5 overlapping neighborhoods of features (pooling size)
    2. Neurons on 1st sublayer connect to pixels in all input channels/maps.  Neurons in 2nd sublayer connect to pixels of only one channel/map
    3. First layer outputs linear, 2nd is sqrt of sum of inputs, so l2 pooling
  22. Weights of 2nd sublayer are fixed, and weights for 1st and 3rd are learned – method includes a parameter that allows for tradeoff between sparsity and reconstruction.  All parameters trained jointly.
  23. Optimization through asynchronous stochastic gradient descent
  24. On to results
  25. The best <single> neuron in the network has 81.7% accuracy detecting faces – all negative guessing is at 64.8%.  Best performance from a single layer network is 71% <It would be nice to see other metrics like specificity or whatever>
    1. Removing local contrast normalization makes accuracy drop to 78.5%
  26. When removing images with faces from test set (as identified by openCV), accuracy of face detector of best neuron fell to 72.5% which is as bad as simple linear filter
  27. <I’m not clear on how they did the unsupervised training, but since its said to be like an autoencoder its probably based on the error of representing the data itself.  >
  28. To do the ImageNet task, added logistic classifier on the top of the network, and lower layers and logistic classifiers were adjusted

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: