- Backprop can be slow for multilayered nets that are non-convex, high-D, very bumpy, or have many plateaus. It may not even converge
- Instead of doing all dataset in batch, noise introduced by following the gradient of individual samples can be helpful
- Tends to be faster (especially cases where there is redundancy in the data)
- Converges to better results (noise in samples jumps out of local minima)
- Can be used for tracking <nonstationary data>
- Better suited to large datasets

- Advantages for batch:
- Mathematical properties of convergence better understood (stochasticity from individual samples has less of an impact when all the data is considered together, so convergence is easier to analyze)
- Some methods (such as conj gradient) only work in batch

- Annealing schedule for learning rate helps convergence, but is an additional parameter that has to be controlled
- Can try and get the best of both worlds by doing mini batches
- Also, trying to eliminate the impact of noise on the final setting of the weights probably isn’t that important; overtraining becomes more of a concern before then

- Methods of bagging or shuffling the data help when not running in batch mode. It is important to keep the samples spread around the range of inputs to prevent overfitting a small region of the input space if that is where samples become concentrated from in some part of the corpus
- Proposes a informal boosting method, although this can lead to overfitting the hard-to-fit samples

- Inputs should be normalized (average value of each input should be zero), should also have same covariance (whitening)
- The exception is if some samples are known to be less important covariance of those samples can be reduced to downweight

- Sigmoids are good non linear functions because they serve to keep weights normalized as they propagate through the network because they are more likely to produce values that are on average 0
- Tanh better than logistic (as logistic isn’t zero-centered), although now simple relus are preferred
- Give recommended constants to be used with the tanh s.t. weights are also likely to have variance of 1 as it goes through the network

- One problem with symmetric sigmoids is that error surface can be very flat near the origin. Therefore it isn’t good to initialize with very small weights. This is also true far from the origin – adding a small linear term can help get rid of the plateaus
- For classification, it isn’t good to use binary target values (+/- 1) as it leads to instabilities.
- Because the sigmoids only have these values asymptotically, using these goal outputs will make the weights grow larger and larger. This then makes the gradients larger
- Also, when the outputs end up producing values close to +/-1, there is no measure of confidence
- Target values should be at the max of the second derivative of the sigmoid

- Intermediate weights should be used for initialization, such that the sigmoid is activated in it its linear region
- If weights are too small the gradients will also be too small

- Getting everything right requires that the training set be normalized, the sigmoid be chosen properly, and that weights be set correctly
- An equation is given for the standard deviation the weights should be set to

- Another issue is choosing the learning rates
- Many adaptive methods for setting the learning rates only work in batch mode, because in the online case things jump around constantly
- This is discussed more later in the paper but one idea is to use independent update values for each weight so that they converge more or less at the same rate (one way to do this is by looking at second derivatives)
- Learning rates at lower layers should be larger than those at higher layers because the second derivative of the cost function wrt weights in lower layers is usually smaller than those in the higher layers
- For conv. nets, learning rate should be proportional to sqrt(# connections sharing weight)

- Mentions one possible rule for adaptive learning rates, but the idea is that it is large away from the optimum and becomes small near it
- Some theory about learning rates – it can be computed exactly if the shape can be approximated by a quadratic. If it isn’t exactly quadratic you can use the rules as an approximation but then need to iterate on it
- The hessian is a measure of curvature of the error surface.
- There is an equation involving the learning rate and the hessian, which, if it always shrinks a vector (all eigenvalues < 1) the update equation will converge
- Goal then is to have different learning rates across different eigendirections, based on its eigenvalue
- If weights are coupled, H must first be rotated s.t. it is diagonal (making the coord axes line up w/ eigendirections)

- Now going back to justifications for some tricks
- Subtract means from input vars because a nonzero mean makes a very large eigenvalue, which makes convergence very slow. Likewise, data that isn’t normalized will slow learning as well
- Decorrelating the input variables make the method of different learning rates per weight optimal
- Whitening the signal can help make the energy surface at the optima spherical which helps convergence
- Talk about newton updates <but I think this isnt used in practice? so im not taking notes>
- Newton and gradient descent are different, but if you whiten they are the same <?>
- Conj Gradient:
- O(N)
- Doesn’t use explicit Hessian
- Tries “…to find descent directions that try to minimally spoil the result achieved in the previous iterations
- Uses line search <?>. Ah, given a descent direction (the gradient), just minimize along this
- Only batch
- conj directions are orthogonal in space of identity hessian matrix
- Good line search method is critical
- Can be good for momentum

- Quasi-Newton (BFGS)
- Iteratively computes estimate of inverse Hessian
- O(N^2) – memory as well, so only applicable to small networks <this is what Hessian free fixes right?>
- Reqs line search
- Batch

- <I think much of the rest of the paper is less relevant now than when it was written back then so skimming>
- Tricks for computing the Hessian
- Large eigenvalues in the Hessian cause problems during training because:
- Non-zero mean inputs
- Wide variation of 2nd derivatives
- Correlation between state vars

- Also between layers, the Hessian at first layer is pretty flat but becomes pretty steep by the last layer
- “From our experience we know that a carefully tuned stochastic gradient descent is hard to beat on large classification problems.”
- There are methods for estimating the principal eigenvalues/vectors of the Hessian w/o actually computing the Hessian

We used this as a reference in the Galvanize neural networks class. Thanks for the summary!