## Neural Networks for Machine Learning. Hinton. Coursera Lectures 2012.

Lecture 4

Learning to Predict the Next Word

• Using backprop to find a feature representation of a meaning of a word
• Turn relational information into feature vector
• Gives example of learning a family tree, by describing relationships verbally (james has-wife victoria)
• Furthermore, given: (colin has-father james) and (colin has-mother victoria), then it should also learn (james has-wife victoria)
• Likewise, further inference can be done from this inferred property that (james has-wife victoria)
• So the relational learning task is figuring out the regularities
• This is traditionally done by formal logical statements, searching over all possible sets of rules is intractable
• An alternate approach is to use ANNs to find them
• There is a particular 5-layer architecture that was hand-designed for this task
• The input layer has 1-bit for each person, and 1-bit for each type of relationship, so inputs have 2 bits on, each in subvectors.  The output is a bit-vector that should just encode the person that corresponds to the relation queried in the input
• The second layer of the network has 6 possible values, so the 1-high-bit 24 bit input vector is projected into a less sparse smaller representation
• The discussion of this is unclear but claim is that one bit represents whether the person is from the English or Italian group of people (they are disjoint in their relationship, but identical in their relationship structure)
• The second set of weights represents the generation of the person
• The third weight encodes which half of the tree the person lives in
• All this encoding is done automatically and implicitly
• All this encoding is only useful if the later levels in the network leverage the encoding in a reasonable way
• In a 4-hold-out training task, it gets 2 or 3 right on average, which is way above chance (which is 1/24)
• That was all stuff from the 80s
• Another option is to have the input be 2 people and their relationship and then predict whether that was correct or not

A Brief Diversion into Cognitive Science (worth skipping)

• Do represent things by feature representations (feature theory), or by their relationships to other things (structuralist) <isn’t the former a form of the latter?>
• <Oh yes> Goes on to say both are wrong because ANNs can use vectors of sematic features to make a relational graph
• The way to do this is a distributed representation (as opposed to try to use the topology of the ANN to recreate the topology of the relationships concretely)

Another Diversion: The Softmax Output Function

• How to normalize the output so it corresponds to a distribution over outputs
• (digresses into why squared error isnt good, basically can cause backprop to have almost no gradient.  Also, squared error is a problem when answers need to sum to 1)
• Instead, use softmax, which is an exponential weighting of activation and activation of other nodes in the defined related softmax group.  Because of normalization everything sums to 1 and ranges from 0 to 1.
• It also has a nice derivative which is important
• So if the squared error is the wrong thing to use when trying to represent a distribution over options, the cross-entropy is the right cost function (negative log probability of the right answer) C = – \sum_ j t_ j log y_ j
• Maximize log-probability of getting the answer right
• Has a big gradient when the right answer is 1 and the output is close to 0 (this is the opposite of what happens with standard squared error)
• When using the chain rule of the cross entropy with normal error (?) the final result is just y_i – t_i

Neuro-probabilistic Language Models

• Identifying individual phonemes is difficult in natural speech
• Because the task itself is fundamentally hard, a good way to improve performance is to use statistics of what sounds are likely to preceede/follow, which can mean guess which other words are going to be next
• The standard method of prediction is trigrams – that is learn distributions where a-b-c occurs; given a,b, predict probability of c or d.  Going to 4-grams leads to data that is just too sparse
• Trigram model, however fails to generalize past the exact data it has seen.  For example, a sentence that has the word ‘cat’ in it would probably still make sense if that word was replaced with ‘dog’, even though the sentence with ‘dog’ in it was never seen
• If words are replaced with a vector of semantic/syntactic features this sort of replacement becomes easy
• This sort of feature representation also allows us to keep track of a history longer than 3 that exists in trigrams (for example, not hard to do 10)
• This is Bengio’s approach
• The structure of the ANN used in this practical example is identical to the one used earlier in the family tree example (inputs are long bit vectors where only 1 bit is high corresponding to a particular word, same as where the bit corresponded to a person in the family tree)
• The output is now a distribution over next possible words, through a giant softmax corresponding to every word considered
• The second layer (compressed dense vector) is also connected directly to the output
• That approach originally worked slightly worse than trigrams but now works much better after later improvements
• A problem with this is the giant softmax is working over a huge number of possible options, so we need tons of weights and there is a risk of overfitting, unless data is vast
• This particular issue is addressed next

Ways to Deal with the Large Number of Possible Outputs

• Instead of having a ton of output weights (one for each word), add an extra vector in the inputs for possible following words, and then the output should be the probability of that word being next
• So that gives logits for particular words, then combine all those in a softmax outside the ANN after all the reasonable possibilities have been attempted
• Again this gives use the cross-entropy error derivatives
• <Moves onto another option, probably can skip>
• Alternately, can do word prediction by predicting a path through a tree
• The nice thing is naturally you only consider log(N) things intead of N
• “Unfortunately, it is still slow at test time.” Because you still have to run many different candidate words through to get a distribution as opposed to a single prediction
• Another option is to look at both past and future words – the goal is just to build a good representation, but use the old setting, not the tree-based one (again, more bit vectors are just added into the input)
• This seems to give good results, captures contextual information well

Lecture 5

Why Object Recognition is Difficult

• We are so good at it, its hard to understand why its hard
• Definitions of objects are hard, requires lots of knowledge, variations in perspective lighting (pixel intensity is as dependent on object as it is on lighting), scale
• Just doing segmentation is hard – we have 2 eyes and motion to do stereopsis to help
• Occlusion makes job hard
• Same object can look very different (for example, written numeral 2 or 4)
• Many objects are defined more by what it is used for than what it looks like
• You sit in a chair, but modern vs classic chairs can be wildly different (kneeling chair)t, you have to have knowledge that the thing is to be sat on
• Viewpoint changes cause “dimension hopping”
• Gives the example of age and weight randomly changing locations in a medical database

Ways to Achieve Viewpoint Invariance

• A few common approaches:
• Use redundant invariant features
• Box objects and normalize pixels
• Replicated features with pooling (convolutional neural nets – lecture 5c)
• Hierarchy of parts that have explicit poses relative to camera (lecture 5e)
• Invariant feature approach say extract large redundant set of features that are invariant under transformations
• With enough invariant features, theres only one way to put them together into an image (relationships between features are automatically captured by other features due to multiple overlaps)
• Need to avoid features that are parts of objects
• “Judicious normalization approach” or boxing/normalizing objects
• Solves “dimension hopping” if the box is always done correctly
• Can even correct for shear, stretch
• Boxing, however is difficult, for example occlusion, segmentation errors
• You need to know what the shape is in order to box it right, which is the problem we were looking to solve already
• “Brute force normalization [boxing]”
• Use very clean data for training, so boxing can be done accurately and cleanly
• At test time try to throw noisier less clean data
• Important that the network can tolerate some sloppiness in the boxing so more coarse/less accurate boxing can be done at test time

Convolutional Nets for Digit Recognition

• One of the first big successes of ANNs in the 80s – LeCun
• Convnets are based on replicated features (useful because there is no reason a feature detector should be stuck observing only part of the image)
• On the other hand, replicating across scale and orientation is much more difficult and expensive
• Replication across position drastically reduces the hypothesis space, makes learning easier/faster
• Backprop works nicely in the setting
• More specifically, if neurons start with a linear constraint, after training that constraint can be maintained
• Basically you can just maintain the average of the recommended changes across coupled features
• What does replicated features achieve?
• *Not* translation invariance
• “Equivaraince” not “Invariance” – that is a change in image (translation) causes a translation in the representation as well
• The knowledge, however, is invariant
• Equivariance in activities and invariance in weights
• You can get some translational invariance by averaging (or max-ing) neighboring replicated features
• Reduces inputs to next layer, so that smaller input set can be analyzed by more features
• This, however means that information about where things are is ultimately lost, which sometimes is/isn’t a problem (you can probably recognize there is a face in the image, but not whose face it is – where are the eyes relative to the nose?)
• Le Net
• Many hidden layers
• Many maps of replicated units at each layer
• Pooling of outputs of nearby replicated units
• Input was wide, and could deal with several characters at once (instead of just one), even if there was overlap
• Did a whole training system, not just an algorithm for the ANN
• I remember learning that they got more mileage out of their data by translating, rotating, nosifying the information and training on that as well
• Did maximum-margin training before it was a thing
• At one point read 10% of checks in N. America – I think it was also used by the post office
• http://yann.lecun.com
• Can leverage prior knowledge about the task in a number of ways (this was done in LeNet5)
• Connectivity, weight constraints, neuron activation functions
• Less heavy-handed (so probably more robust) than manually creating features, but still reduces the hypothesis space so learning can be done more quickly
• Our best idea of how to do object recognition is to have feature detectors that are pooled and gradually cover more and more of the image – certain topologies make that happen
• Can use various forms of simulation/bagging “synthetic data” in order to get more training done, this makes more sense as computational power increases and the algorithm may be able to better without some of the topological constraints designed into the system to speed up training
• Synthetic data prevents overfitting even on very large nets because corpus size is increased
• Various tricks w/synthetic data got error rate on the classic writing dataset down from 80 to 40, boosting/bagging down to 25

Convolutional Nets for Object Recognition

• Could the tricks for handwritten digits be used for robust naturalistic image detection?
• Basic issues are:
• Much more classes of objects
• Much larger color images
• 2D projection of 3D scene, so much information missing
• Occlusion
• Multiple different objects in the same scene
• Differences in lighting
• Problem is so big, reducing hypothesis space by engineering topology is useful
• ILSVRC-2012 Imagenet competition:
• 1.2 million high-res images
• Guess category (of 1000 possible) – allowed to list top 5 guesses
• Or just to accurately put a bounding box on the object
• More classical approaches tried on the problem don’t do learning all they way through <in particular, usually the features themselves are hand-constructed (like sift) whereas in ANNs they are learned>.  Classical methods have learning just at the last step of a number of steps of processing
• Error rates of the neural nets are half of what the other classical methods got – everything else is hovering between 25% and 30% wrong, the ANN got 16% wrong
• Topology of winning network:
• A very deep (7 hidden, with some additional pooling layers) convolutional neural net
• Early layers were convolutional (shared weights)
• Last two layers globally connected (whats that?) <I guess this means within a layer as well)
• Activation functions are “rectified linear units” for hidden layers and train much faster and are more expressive <why is that – I thought the opposite?> than logistic units
• Rectified linear units are now considered the way to go
• Competitive normalization – suppresses hidden activities when nearby units have stronger activity – helps with variations in intensity
• You dont want something to influence results much if it is weak and there are much stronger things nearby
• Other tricks
• Train on slightly smaller, slight translations of training images, mirror reflections
• At test time, examine 10 patches of size of translated images examined
• Uses a method called “dropout” to regularize weights globally and prevent overfitting
• Is sort of a means of bagging – for each training example half of the hidden units are removed at random.
• Makes it more robust, each node can rely less on idiosyncrasies of other nodes
• Implemented on GPUs (just a pair, about 1000 cores total) – lead to 30x speedup
• Can compute activations between layers extremely quickly by matrix multiplication
• Fast memory access (DDR5 when processors are on DDR3)
• On-Die cache for normal processor isn’t so helpful because data used frequently is larger than cache size
• Training took a week
• <From separate sources I know the cat classifier at Google worked better on standard PC cores (16,000 cores, 1.7 billion connections).  Hinton thinks the future of ANNs is on GPUs though>
• “As cores get cheaper and datasets get bigger, big neural nets will improve faster than old-fashioned (i.e. pre Oct 2012) computer vision systems.”
• Can rely less on hand engineering more on data, less likely to overfit, can consider larger hypothesis space
• He is convinced that deep nets have closed the book on other methods