Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei. Tech Report 2014


  1. Model generates textual description of natural images
  2. Trained from a corpus of images with included textual descriptions
  3. “Our approach is based on a novel combination of Convolutional Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding.  We then describe a Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions.”
  4. Previous work in the area revolves around fairly constrained types of descriptions
    1. Discusses related work at length
  5. Use verbal descriptions of images as “weak labels in which contiguous segments of words correspond to some particular, but unknown location in the image.  Our approach is to infer these alignments and use them to learn a generative model of descriptions.”
  6.  “…our work takes advantage of pretrained word vectors […] to obtain low-dimensional representations of words.  Finally, Recurrent Neural Networks have been previously used in language modeling […], but we additionally condition these models on images.”
  7. Training happens on images with coupled text description.
    1. They have a model that aligns sentence segments to image segments through a “multimodal embedding”
    2. Then these correspondences are fed into their multimodal RNN which learns to generate descriptions
  8. Use bidirectional RNN to compute sentence word representation, which means dependency trees aren’t needed “and allowing unbounded interactions of words and their context in the sentence.”
  9. Use a pretrained Region Convolutional Neural Network to pull out both what and where of the image
  10. They embed words in the same dimensional representation that image regions have, they do this by taking each word surrounded by a window and transforming that into a vector of equal size
  11. Screen Shot 2014-11-24 at 11.34.13 AM
  12. Hidden representation they use is on the order of hundreds of dimensions
  13. Use Markov Random Fields to enforce that adjacent words are found to correspond to similar areas in the image
  14. RNN is trained to combine the following to predict the next word:
    1. Word (initialized to “the”)
    2. Previous hidden state (initialized to 0)
    3. Image information
  15. Optimized with SGD, 100-size minibatch, momentum, dropout
  16. RNN is difficult to optimize because of rate of occurrence of rare vs common words
  17. Beat the competitor approaches on all datasets tested (not by enormous margins, but it wins)
  18. They can accurately deal with making sentences that involve even rare items such as “accordion”, which most other models would miss
  19. <The results they show are pretty amazing, although the forms of the sentences are pretty uniform and simplistic subject-verb-noun>
  20. “going directly from an image-sentence dataset to region-level annotations as part of a single model that is trained end-to-end with a single objective remains an open problem.”
Advertisements
Tagged , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: