- Deals with one-shot learning
- “…a Hierarchical Bayesian model based on compositionality and causality that can learn a wide range of natural (although simple) visual concepts, generalizing in human-like ways from just one image.”
- In 1-shot learning did about as well as people, and better than deep learning methods
- People can learn a new concept from a tiny amount of data – can learn a class from one image (which is a very high dimensional piece of data)
- Even MNIST, which is an old and small dataset, still has 6k samples/class, but people often only need 1 example
- “Additionally, while classification has received most of the attention in machine learning, people can generalize in a variety of other ways after learning a new concept. Equipped with the concept “Segway” or a new handwritten character (Figure 1c), people can produce new examples, parse an object into its critical parts, and fill in a missing part of an image. While this flexibility highlights the richness of people’s concepts, suggesting they are much more than discriminative features or rules, there are reasons to suspect that such sophisticated concepts would be difficult if not impossible to learn from very sparse data. ”
- Looks like people both have a rich hypothesis space (because they can do all the above with a small amount of data), but also don’t overfit, which is the theoretical downside to having a large hypothesis class. How do they do it?
- Here, focus is on handwritten characters
- Idea is to use something more natural and complex than simple synthetic stimuli, but something less complex than natural images
- Use a new omniglot dataset, which has 16k characters with 20 examples each
- Also has time data so the strokes are recorded as well
- “… this paper also introduces Hierarchical Bayesian Program Learning (HBPL), a model that exploits the principles of compositionality and causality to learn a wide range of simple visual concepts from just a single example.”
- Also use the method to generate new examples of a class, and then do a Turing test with it by asking other humans which was human generated and which was machine generated
- The HBPL “…is compositional because characters are represented as stochastic motor programs where primitive structure is shared and re-used across characters at multiple levels, including strokes and sub-strokes.”
- The model attempts to find a “structural description” that explains the image by breaking the character down into parts
- A Character is made of:
- A set of strokes
- Each stroke is made of simple sub-strokes modeled by a “uniform cubic b-spline” and is built of primitive motor elements that are defined by a 1st order Markov Process
- Set of spatial relationships between strokes, can be:
- Independent: A stroke that has a location independent of other strokes
- Start/end: A stroke that starts at beginning/end of another stroke
- Along: A stroke that starts somewhere along a previous stroke
- A set of strokes
- ” Each trajectory … is a deterministic function of a starting location … token-level control points … and token-level scale …. The control points and scale are noisy versions of their type-level counterparts…”
- Used 30 most common alphabets for training, and another 20 for evaluation. The training set was used to learn hyperparameters, a set of 1000 primitive motor elements, and stroke placement. They attempted to do cross-validation within the training set
- The full set of possible ways a stroke could be created is enormous, so they have a botto-up way of finding a set of the K most likely parses. They approximate the posterior based on this finite, size-K sample based on their relative likelihoods
- They actually then use metropolis-hasting to get a number of samples of each parse with a little variance each to get a better estimate of the likelihoods
- “Given an approximate posterior for a particular image, the model can evaluate the posterior predictive score of a new image by re-fitting the token-level variables…”
- Results
- For the 1-shot tasks, a letter from an alphabet was presented with 20 other letters from the same alphabet. Each person did this 10 times, but each time was with a totally new alphabet, so no characters was ever seen twice
- Get K=5 parses of each character presented (along with MCMC), and then run K gradient searches to reoptimize the token-level variables to fit the query image.
- They can also, however, attempt to reoptimize the query image to fit the 20 options presented
- Compare against:
- Affine model
- Deep Boltzmann Machines
- Hierarchical Deep Model
- Simple Strokes (a simplified HBPL)
- NN
- Humans and HBPL ~4.5% error rate, affine model next at 18.2%
- Then they did one-shot Turing test where people and algorithms had to copy a single query character
- <For what its worth, I think Affine looks better than both results from people and HBPL>
- In the “Turing test” there was feedback after each 10 trials, for a total of 50 trials
- <Note that this test doesn’t ask which character looks best, it is which is most confusable with human writing (which is pretty sloppy from the images they show). I’m curious if the affine model could be made more human just by adding noise to its output>
- <Playing devil’s advocate, the images of characters were collected on mTurk, and look like they were probably drawn with a mouse — that is to say I feel they don’t look completely like natural handwriting. I wonder how much of this program is picking up on those artifacts? At least in terms of reproduction, the affine method looks best>
Science 2015
- “Concepts are represented as simple probabilistic programs—that is, probabilistic generative models expressed as structured procedures in an abstract description language (…). Our framework brings together three key ideas—compositionality, causality, and learning to learn—that have been separately influential in cognitive science and machine learning over the past several decades (…). As programs, rich concepts can be built “compositionally” from simpler primitives
- “In short, BPL can construct new programs by reusing the pieces of existing ones, capturing the causal and compositional properties of real-world generative processes operating on multiple scales.”
- <Looks like exactly the same paper, just more brief. The accuracies of both BPL and other methods seems improved here, though. Convnets get 13.5% error; BPL gets 3.3%; people get 4.5%. “A deep Siamese convolutional network optimized for this one-shot learning task achieved 8.0% errors”>
- “BPL’s advantage points to the benefits of modeling the underlying causal process in learning concepts, a strategy different from the particular deep learning approaches examined here.”
- <Or equivalently you can just say BPL does better because it has a small and highly engineered hypothesis class>
- Also run BPL with various “lesions” and gets error rates in the teens. Also did more poorly in the “Turing test” part
- Instead of training on 30 background alphabets, they also did with just 5, and there the error rates are about 4%; on the same set convnets did about 20% error
-
Supplementary Material
- <I assumed that they would ask individuals who actually learned how to write the languages to do the recordings. Instead, they just took pictures of characters and had people write them. This seems like a problem to me because of inconsistencies in the way people would actually do the strokes of a letter in an alphabet they do not know.>
- <Indeed, they were also drawn by mouse in a box on a screen, which is a very unnatural way to do things>
- <From what I can tell the characters are recorded in pretty low resolution as well which looks like it can cause artifacts, looks like 105×105>
- <This basically has the details that were included in the main part of the NIPS paper>
- Some extra tricks like convolving with Gaussian filter, randomly flipping bits
- Primitives are scale-selective
- “For each image, the center of mass and range of the inked pixels was computed. Second, images were grouped by character, and a transformation (scaling and translation) was computed for each image so that its mean and range matched the group average.”
- ” In principle, generic MCMC algorithms such as the one explored in (66) can be used, but we have found this approach to be slow, prone to local minima, and poor at switching between different parses. Instead, inspired by the speed of human perception and approaches for faster inference in probabilistic programs (67), we explored bottom-up methods to compute a fast structural analysis and propose values of the latent variables in BPL. This produces a large set of possible motor programs – each approximately fit to the image of interest. The most promising motor programs are chosen and refined with continuous optimization and MCMC.”
- “A candidate parse is generated by taking a random walk on the character skeleton with a “pen,” visiting nodes until each edge has been traversed at least once. Since the parse space grows exponentially in the number of edges, biased random walks are necessary to explore the most interesting parts of the space for large characters. The random walker stochastically prefers actions A that minimize the local angle of the stroke trajectory around the decision point…”
- For the ANN they used cafe, and took a network that works well on MNIST
- <But it seems like this system doesn’t have any of the special engineering that went into this that deals specifically with strokes as opposed to whole images>
- “The raw data was resized to 28 x 28 pixels and each image was centered based on its center of mass as in MNIST. We tried seven different architectures varying in depth and layer size, and we reported the model that performed best on the one-shot learning task.”
- <This may make the task easier, but MNIST deals with a small number of characters, many of which are much less complex than some of the characters used here. It might be the case that some of the more complex characters can’t be accurately reduced to such a small size, so this may be hobbling performance>
- Also the network is not very deep – only 2 conv layers and a max-pooling
- “One-shot classification was performed by computing image similarity through the feature representation in the 3000 unit hidden layer and using cosine similarity.”
- They used a smaller net for the 1-shot classification with less data, <so that was nice of them>
- The full “Siamese network” did work on the 105×105 image, had 4 conv layers and 1 standard hidden layer. Parameters were optimized with Bayesian method
- “The Hierarchical Deep model is more “compositional” than the deep convnet, since learning-to-learn endows it with a library of high-level object parts (29). However, the model lacks a abstract causal knowledge of strokes, and its internal representation is quite different than an explicit motor program. “
- For data collection “The raw mouse trajectories contain jitter and discretization artifacts, and thus spline smoothing was applied.”
- <Ok, skipping the rest>