DRAW: A Recurrent Neural Network For Image Generation. Gregor, Danihelka, Graves, Rezende, Wiestra. JMLR 2015


  1. “…introduces the Deep Recurrent Attentive Writer (DRAW)…  DRAW networks combine a novel spatial attention mechanism that mimics the foveation of the human eye, with a sequential variational auto-encoding framework that allows for the iterative construction of complex images.”
  2. Can generate house numbers from the google street number dataset that are indistinguishable from real images
  3. Instead of generating images all at once, this approach tries the equivalent of sketching an image first and then refining it
  4. “The core of the DRAW architecture is a pair of recurrent neural networks: an encoder network that compresses the real images presented during training, and a decoder that reconstitutes images after receiving codes. The combined system is trained end-to-end with stochastic gradient descent, where the loss function is a variational upper bound on the log-likelihood of the data. It therefore belongs to the family of variational auto-encoders…”
  5. They train the attentional system with RL, and use backprop “In this sense it resembles the selective read and write operations developed for the Neural Turing Machine (Graves et al., 2014).”
  6. “tion over images. However there are three key differences. Firstly, both the encoder and decoder are recurrent networks in DRAW, so that a sequence of code samples is exchanged between them; moreover the encoder is privy to the decoder’s previous outputs, allowing it to tailor the codes it sends according to the decoder’s behaviour so far. Secondly, the decoder’s outputs are successively added to the distribution that will ultimately generate the data, as opposed to emitting this distribution in a single step. And thirdly, a dynamically updated attention mechanism is used to restrict both the input region observed by the encoder, and the output region modified by the decoder. In simple terms, the network decides at each time-step “where to read” and “where to write” as well as “what to write”.”
  7. The output of the encoder network is a hidden vector
  8. They use LSTM for their recurrent network
  9. The output of the encoder is used to parameterize a distribution over a latent vector (which is a diagonal Gausian).  They use a diagonal Gaussian instead of the more common Bernoulli distribution because it has a gradient that is easier to work with
  10. A sample from the latent distribution is then passed as input to the decoder
    1. The output of the decoder is added cumulatively to a canvas matrix which creates an image.  The number of steps used to write to the canvas is a parameter to the algorithm
  11. “The total loss is therefore equivalent to the expected compression of the data by the decoder and prior. “
  12. “…, we consider an explicitly twodimensional form of attention, where an array of 2D Gaussian filters is applied to the image, yielding an image ‘patch’ of smoothly varying location and zoom.”
  13. <skipping a bunch>
  14. Generated images of mnist nad street house numbers look good, but generated natural images look very blurry and not much like anything identifiable, although there is clear structure in what is generated.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: