Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift. Ioffe, Szegedy. Arxiv 2015

  1. A problem with training ANNs is that as training occurs, the distribution of inputs for higher layers changes (called covariate shift).  Here they do normalization <whitening at each layer? yes> of inputs for each mini batch.
  2. Trains faster, trains to better results, is itself a form of regularization so removes need for dropout in some cases
  3. Saturation occurs frequently as a result of covariate shift.  If we can avoid that then it may make it easier to train with larger learning rates
  4. “Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values. This allows us to use much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for Dropout (Srivastava et al., 2014). Finally, Batch Normalization makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes.”
  5. Normalization parameters must be computed inside the gradient descent step (so in batch mode, and not online).  This can be shown both theoretically and in practice
  6. Normalization is done by each input independently (this is to save computational costs, and also because there needs to be some computation that isn’t differentiable <but needs to be?>)
    1. <I guess so, later on:>  “Thus, BN transform is a differentiable transformation that introduces normalized activations into the network. “
  7. In order to make sure the normalization doesn’t ruin expressability of the layer, a constraint is that “the transformation inserted in the network can represent the identity transform”
  8. “In traditional deep networks, too-high learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima.”
  9. Naturally it also helps deal with scaling issues in the inputs
  10. “Moreover, larger weights lead to smaller gradients, and Batch Normalization will stabilize the parameter growth.”
  11. Because it acts as regularization, can remove the need for dropout and ReLUs, as well as other forms of regularization (such as L2 weight regularization), can also allow for slower weight decay
  12. Get state of the art results on imagenet, and reaches human-level performance
  13. “Batch Normalization adds only two extra parameters per activation, and in doing so preserves the representation ability of the network.”
  14. State that this may help with training problems that are part of RNNs

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: