- Model generates textual description of natural images
- Trained from a corpus of images with included textual descriptions
- “Our approach is based on a novel combination of Convolutional Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions.”
- Previous work in the area revolves around fairly constrained types of descriptions
- Discusses related work at length
- Use verbal descriptions of images as “weak labels in which contiguous segments of words correspond to some particular, but unknown location in the image. Our approach is to infer these alignments and use them to learn a generative model of descriptions.”
- “…our work takes advantage of pretrained word vectors […] to obtain low-dimensional representations of words. Finally, Recurrent Neural Networks have been previously used in language modeling […], but we additionally condition these models on images.”
- Training happens on images with coupled text description.
- They have a model that aligns sentence segments to image segments through a “multimodal embedding”
- Then these correspondences are fed into their multimodal RNN which learns to generate descriptions
- Use bidirectional RNN to compute sentence word representation, which means dependency trees aren’t needed “and allowing unbounded interactions of words and their context in the sentence.”
- Use a pretrained Region Convolutional Neural Network to pull out both what and where of the image
- They embed words in the same dimensional representation that image regions have, they do this by taking each word surrounded by a window and transforming that into a vector of equal size
- Hidden representation they use is on the order of hundreds of dimensions
- Use Markov Random Fields to enforce that adjacent words are found to correspond to similar areas in the image
- RNN is trained to combine the following to predict the next word:
- Word (initialized to “the”)
- Previous hidden state (initialized to 0)
- Image information
- Optimized with SGD, 100-size minibatch, momentum, dropout
- RNN is difficult to optimize because of rate of occurrence of rare vs common words
- Beat the competitor approaches on all datasets tested (not by enormous margins, but it wins)
- They can accurately deal with making sentences that involve even rare items such as “accordion”, which most other models would miss
- <The results they show are pretty amazing, although the forms of the sentences are pretty uniform and simplistic subject-verb-noun>
- “going directly from an image-sentence dataset to region-level annotations as part of a single model that is trained end-to-end with a single objective remains an open problem.”