Tag Archives: Recurrent neural network

Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei. Tech Report 2014

  1. Model generates textual description of natural images
  2. Trained from a corpus of images with included textual descriptions
  3. “Our approach is based on a novel combination of Convolutional Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding.  We then describe a Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions.”
  4. Previous work in the area revolves around fairly constrained types of descriptions
    1. Discusses related work at length
  5. Use verbal descriptions of images as “weak labels in which contiguous segments of words correspond to some particular, but unknown location in the image.  Our approach is to infer these alignments and use them to learn a generative model of descriptions.”
  6.  “…our work takes advantage of pretrained word vectors […] to obtain low-dimensional representations of words.  Finally, Recurrent Neural Networks have been previously used in language modeling […], but we additionally condition these models on images.”
  7. Training happens on images with coupled text description.
    1. They have a model that aligns sentence segments to image segments through a “multimodal embedding”
    2. Then these correspondences are fed into their multimodal RNN which learns to generate descriptions
  8. Use bidirectional RNN to compute sentence word representation, which means dependency trees aren’t needed “and allowing unbounded interactions of words and their context in the sentence.”
  9. Use a pretrained Region Convolutional Neural Network to pull out both what and where of the image
  10. They embed words in the same dimensional representation that image regions have, they do this by taking each word surrounded by a window and transforming that into a vector of equal size
  11. Screen Shot 2014-11-24 at 11.34.13 AM
  12. Hidden representation they use is on the order of hundreds of dimensions
  13. Use Markov Random Fields to enforce that adjacent words are found to correspond to similar areas in the image
  14. RNN is trained to combine the following to predict the next word:
    1. Word (initialized to “the”)
    2. Previous hidden state (initialized to 0)
    3. Image information
  15. Optimized with SGD, 100-size minibatch, momentum, dropout
  16. RNN is difficult to optimize because of rate of occurrence of rare vs common words
  17. Beat the competitor approaches on all datasets tested (not by enormous margins, but it wins)
  18. They can accurately deal with making sentences that involve even rare items such as “accordion”, which most other models would miss
  19. <The results they show are pretty amazing, although the forms of the sentences are pretty uniform and simplistic subject-verb-noun>
  20. “going directly from an image-sentence dataset to region-level annotations as part of a single model that is trained end-to-end with a single objective remains an open problem.”
Tagged , ,

A Review of Unsupervised Feature Learning and Deep Learning for Time-Series Modeling. Langkvist, Karlsson, Loutfi. Pattern Recognition Letters 2014.

  1. Consider continuous, high dimensional, noisy time series data
  2. Also assume it may not be the case that there is enough information in the data to do very accurate prediction (high Bayes error)
  3. Assume nonstationary: x(t) predicts y(t), but at a different t, the same value for x may produce a different prediction
    1. To resolve this, some form of context is required.  With the correct context the process is stationary
    2. Amount of time for context may be unkown, may be very large
  4. May be nonstationary so that summary statistics change over time.  In some cases, change in frequency is so relevant that its better to work in the frequency-domain than the time-domain
    1. <Been meaning to learn about that>
  5. In vision, often things like translation and scale invariance are desired.  In time series analysis, we desire invariance to translations in time
  6. So yeah its a gnarly problem.  Picking the right representation is key
    1. Here they consider finding a representation
  7. Discusses hidden (and hidden and gated) Boltzmann machines, although I won’t take notes because its probably isn’t what they really will use anyway
  8. Auto-encoders
  9. Basic linear auto-encoder is same as PCA
  10. Terms in cost function include for sparsity and to keep weights close to 0 <listed as 2 different things, but how are they distinct?>
  11. Recurrent neural network
  12. Regularization terms “… prevents a the trivial learning of a 1-to-1 mapping of the input to the hidden units.”
  13. RBMs don’t need regularization because stochastic binary hidden unit acts as a regularizer, although it is possible to add regularization on top anyway
  14. Recurrent neural network
  15. Trained by backprop-through-time
  16. “RNNs can be seen as very deep networks with shared parameters at each layer when unfolded in time.”
  17. Deep learning
  18. Convolution, pooling
  19. Other methods for dealing with time-data aside from simple recurrent networks is penalizing changes in the hidden layer from one time step to the next
  20. Also mention slow feature analysis
  21. “Temporal coherence is related to invariant feature representations since both methods want to achieve small changes in the feature representation for small changes in the input data.”
  22. Hidden Markov Models
    1. But require discrete states
    2. Limited representational capacity
    3. Not set up well to track history
  23. “The use of Long-short term memory (…) or hessian-free optimizer (…) can produce recurrent networks that has a memory of over 100 time steps.”
  24. Some models are generative, others are discriminative.  Auto-encoder isn’t generative, but “a probabilistic interpretation can be made using auto-encoder scoring (…)”
  25. <According to the table in the paper, recurrent neural networks are most appropriate for what we are considering>
  26. Discusses video data
    1. Lots of relevant work to investigate here
  27. “The use of deep learning, feature learning, and convolution with pooling has propelled the advances in video processing.”  Deep learning is natural because it is state of the art on still images, but extensions are needed to deal with the temporal aspect
  28. “The early attempts at extending deep learning algorithms to video data was done by modelling the transition between two frames.  The use of temporal pooling extends the time-dependencies a model can learn beyond a single frame transition.  However, the time-dependency that has been modeled is still just a few frames.  A possible future direction for video processing is to look at models that can learn longer time-dependencies.”
  29. Other examples <with a fair amount of space given that I’m skipping> is stock prices and music recognition
  30. Motion Capture Data
  31. Previous applications were Temporal Restricted Boltzmann Machines (TRBM), and conditional RBM (Hinton in both papers), then recurrent TRBM
  32. Mirowski and LeCun used dynamic factored graphs to fill in missing mocap data
  33. “A motivation for using deep learning algorithms for motion capture data is that it has been suggested that human motion is composed of elementary building blocks (motion templates) and any complex motion is constructed from a library of these previously learned motion primitives (Flash and Hochner, 2005).  Deep networks can, in an unsupervised manner, learn these motion templates from raw data and use them to form complex human motions.”
  34. Section on machine olfaction, physiological eeg, meg, ecg
  35. “In order to capture long-term dependencies, the input size has to be increased, which can be impractical for multivariate signals or if the data has very long-term dependencies.  The solution is to use a model that incorporates temporal coherence, performs temporal pooling, or models sequences of hidden unit activations.”