Category Archives: Bayes

One-shot learning by inverting a compositional causal process. Lake, Salakhutdinov, Tenenbaum. NIPS 2013

  1. Deals with one-shot learning
  2. “…a Hierarchical Bayesian model based on compositionality and causality that can learn a wide range of natural (although simple) visual concepts, generalizing in human-like ways from just one image.”
  3. In 1-shot learning did about as well as people, and better than deep learning methods
  4. People can learn a new concept from a tiny amount of data – can learn a class from one image (which is a very high dimensional piece of data)
  5. Even MNIST, which is an old and small dataset, still has 6k samples/class, but people often only need 1 example
  6. “Additionally, while classification has received most of the attention in machine learning, people can generalize in a variety of other ways after learning a new concept. Equipped with the concept “Segway” or a new handwritten character (Figure 1c), people can produce new examples, parse an object into its critical parts, and fill in a missing part of an image. While this flexibility highlights the richness of people’s concepts, suggesting they are much more than discriminative features or rules, there are reasons to suspect that such sophisticated concepts would be difficult if not impossible to learn from very sparse data. ”
    1. Looks like people both have a rich hypothesis space (because they can do all the above with a small amount of data), but also don’t overfit, which is the theoretical downside to having a large hypothesis class.  How do they do it?
  7. Here, focus is on handwritten characters
    1. Idea is to use something more natural and complex than simple synthetic stimuli, but something less complex than natural images
  8. Use a new omniglot dataset, which has 16k characters with 20 examples each
    1. Also has time data so the strokes are recorded as well
  9. “… this paper also introduces Hierarchical Bayesian Program Learning (HBPL), a model that exploits the principles of compositionality and causality to learn a wide range of simple visual concepts from just a single example.”
  10. Also use the method to generate new examples of a class, and then do a Turing test with it by asking other humans which was human generated and which was machine generated
  11. The HBPL “…is compositional because characters are represented as stochastic motor programs where primitive structure is shared and re-used across characters at multiple levels, including strokes and sub-strokes.”
  12. The model attempts to find a “structural description” that explains the image by breaking the character down into parts
  13. A Character is made of:
    1. A set of strokes
      1. Each stroke is made of simple sub-strokes modeled by a “uniform cubic b-spline” and is built of primitive motor elements that are defined by a 1st order Markov Process
    2. Set of spatial relationships between strokes, can be:
      1. Independent: A stroke that has a location independent of other strokes
      2. Start/end: A stroke that starts at beginning/end of another stroke
      3. Along: A stroke that starts somewhere along a previous stroke
  14. ” Each trajectory … is a deterministic function of a starting location … token-level control points … and token-level scale …. The control points and scale are noisy versions of their type-level counterparts…”
  15. Used 30 most common alphabets for training, and another 20 for evaluation.  The training set was used to learn hyperparameters, a set of 1000 primitive motor elements, and stroke placement.  They attempted to do cross-validation within the training set
  16. The full set of possible ways a stroke could be created is enormous, so they have a botto-up way of finding a set of the K most likely parses.  They approximate the posterior based on this finite, size-K sample based on their relative likelihoods
    1. They actually then use metropolis-hasting to get a number of samples of each parse with a little variance each to get a better estimate of the likelihoods
  17. “Given an approximate posterior for a particular image, the model can evaluate the posterior predictive score of a new image by re-fitting the token-level variables…”
  18. Results
  19. For the 1-shot tasks, a letter from an alphabet was presented with 20 other letters from the same alphabet.  Each person did this 10 times, but each time was with a totally new alphabet, so no characters was ever seen twice
  20. Get K=5 parses of each character presented (along with MCMC), and then run K gradient searches to reoptimize the token-level variables to fit the query image.
  21. They can also, however, attempt to reoptimize the query image to fit the 20 options presented
  22. Compare against:
    1. Affine model
    2. Deep Boltzmann Machines
    3. Hierarchical Deep Model
    4. Simple Strokes (a simplified HBPL)
    5. NN
  23. Humans and HBPL ~4.5% error rate, affine model next at 18.2%
  24. Then they did one-shot Turing test where people and algorithms had to copy a single query character
    1. <For what its worth, I think Affine looks better than both results from people and HBPL>
  25. In the “Turing test” there was feedback after each 10 trials, for a total of 50 trials
    1. <Note that this test doesn’t ask which character looks best, it is which is most confusable with human writing (which is pretty sloppy from the images they show).  I’m curious if the affine model could be made more human just by adding noise to its output>
  26. <Playing devil’s advocate, the images of characters were collected on mTurk, and look like they were probably drawn with a mouse — that is to say I feel they don’t look completely like natural handwriting.  I wonder how much of this program is picking up on those artifacts?  At least in terms of reproduction, the affine method looks best>


Science 2015

  1. “Concepts are represented as simple probabilistic programs—that is, probabilistic generative models expressed as structured procedures in an abstract description language (…). Our framework brings together three key ideas—compositionality, causality, and learning to learn—that have been separately influential in cognitive science and machine learning over the past several decades (…). As programs, rich concepts can be built “compositionally” from simpler primitives
  2. “In short, BPL can construct new programs by reusing the pieces of existing ones, capturing the causal and compositional properties of real-world generative processes operating on multiple scales.”
  3. <Looks like exactly the same paper, just more brief.  The accuracies of both BPL and other methods seems improved here, though.  Convnets get 13.5% error; BPL gets 3.3%; people get 4.5%.  “A deep Siamese convolutional network optimized for this one-shot learning task achieved 8.0% errors”>
  4. “BPL’s advantage points to the benefits of modeling the underlying causal process in learning concepts, a strategy different from the particular deep learning approaches examined here.”
    1. <Or equivalently you can just say BPL does better because it has a small and highly engineered hypothesis class>
  5. Also run BPL with various “lesions” and gets error rates in the teens.  Also did more poorly in the “Turing test” part
  6. Instead of training on 30 background alphabets, they also did with just 5, and there the error rates are about 4%; on the same set convnets did about 20% error
  7. Supplementary Material

  8. <I assumed that they would ask individuals who actually learned how to write the languages to do the recordings.  Instead, they just took pictures of characters and had people write them.  This seems like a problem to me because of inconsistencies in the way people would actually do the strokes of a letter in an alphabet they do not know.>
  9. <Indeed, they were also drawn by mouse in a box on a screen, which is a very unnatural way to do things>
  10. <From what I can tell the characters are recorded in pretty low resolution as well which looks like it can cause artifacts, looks like 105×105>
  11. <This basically has the details that were included in the main part of the NIPS paper>
  12. Some extra tricks like convolving with Gaussian filter, randomly flipping bits
  13. Primitives are scale-selective
  14. “For each image, the center of mass and range of the inked pixels was computed. Second, images were grouped by character, and a transformation (scaling and translation) was computed for each image so that its mean and range matched the group average.”
  15. ” In principle, generic MCMC algorithms such as the one explored in (66) can be used, but we have found this approach to be slow, prone to local minima, and poor at switching between different parses. Instead, inspired by the speed of human perception and approaches for faster inference in probabilistic programs (67), we explored bottom-up methods to compute a fast structural analysis and propose values of the latent variables in BPL. This produces a large set of possible motor programs – each approximately fit to the image of interest. The most promising motor programs are chosen and refined with continuous optimization and MCMC.”
  16. “A candidate parse is generated by taking a random walk on the character skeleton with a “pen,” visiting nodes until each edge has been traversed at least once. Since the parse space grows exponentially in the number of edges, biased random walks are necessary to explore the most interesting parts of the space for large characters. The random walker stochastically prefers actions A that minimize the local angle of the stroke trajectory around the decision point…”
  17. For the ANN they used cafe, and took a network that works well on MNIST
    1. <But it seems like this system doesn’t have any of the special engineering that went into this that deals specifically with strokes as opposed to whole images>
    2. “The raw data was resized to 28 x 28 pixels and each image was centered based on its center of mass as in MNIST. We tried seven different architectures varying in depth and layer size, and we reported the model that performed best on the one-shot learning task.”
    3. <This may make the task easier, but MNIST deals with a small number of characters, many of which are much less complex than some of the characters used here.   It might be the case that some of the more complex characters can’t be accurately reduced to such a small size, so this may be hobbling performance>
    4. Also the network is not very deep – only 2 conv layers and a max-pooling
    5. “One-shot classification was performed by computing image similarity through the feature representation in the 3000 unit hidden layer and using cosine similarity.”
    6. They used a smaller net for the 1-shot classification with less data, <so that was nice of them>
  18. The full “Siamese network” did work on the 105×105 image, had 4 conv layers and 1 standard hidden layer.  Parameters were optimized with Bayesian method
  19. “The Hierarchical Deep model is more “compositional” than the deep convnet, since learning-to-learn endows it with a library of high-level object parts (29). However, the model lacks a abstract causal knowledge of strokes, and its internal representation is quite different than an explicit motor program. “
  20. For data collection “The raw mouse trajectories contain jitter and discretization artifacts, and thus spline smoothing was applied.”
  21. <Ok, skipping the rest>

Probabilistic machine learning and artificial intelligence. Zoubin Ghahramani. Nature 2015

  1. “discusses some of the state-of-the-art advances in the field, namely, probabilistic programming, Bayesian optimization, data compression and automatic model discovery.”
  2. Given some observed data, there can be many (often infinite) models consistent with the data.  Uncertainty comes up in terms of making the model, and then having the model produce predictions.  “Probability theory provides a framework for modelling uncertainty”
  3. “…he scope of machine-learning tasks is even broader than these pattern classification or mapping tasks, and can include optimization and decision making, compressing data and automatically extracting interpretable models from data.”
  4. “Since any sensible model will be uncertain when predicting unobserved data, uncertainty plays a fundamental part in modelling.”
  5. “There are many forms of uncertainty in modelling. At the lowest level, model uncertainty is introduced from measurement noise, for example, pixel noise or blur in images. At higher levels, a model may have many parameters, such as the coefficients of a linear regression, and there is uncertainty about which values of these parameters will be good at predicting new data. Finally, at the highest levels, there is often uncertainty about even the general structure of the model: is linear regression or a neural network appropriate, if the latter, how many layers should it have, and so on.The probabilistic approach to modelling uses probability theory to express all forms of uncertainty9. Probability theory is the mathematical language for representing and manipulating uncertainty10, in much the same way as calculus is the language for representing and manipulating rates of change.”
  6. Learning occurs by taking a prior, adding data, and producing a posterior
  7. Probability theory is composed of the product rule and sum rule
  8. “The dominant paradigm in machine learning over the past two decades for representing such compositional probabilistic models has been graphical models11, with variants including directed graphs (also known as Bayesian networks and belief networks), undirected graphs (also known as Markov networks and random fields), and mixed graphs with both directed and undirected edges (Fig. 1). As discussed later, probabilistic programming offers an elegant way of generalizing graphical models, allowing a much richer representation of models. The compositionality of probabilistic models means that the behaviour of these building blocks in the context of the larger model is often much easier to understand than, say, what will happen if one couples a non-linear dynamical system (for example, a recurrent neural network) to another.”
  9. Computationally, a problem in many models is integration required to sum out variables not of interest.  In many cases, there is no poly-time algorithm.
    1. Approximation is possible, however, by MCMC and sequential monte-carlo
  10. “t is worth noting that computational techniques are one area in which Bayesian machine learning differs from much of the rest of machine learning: for Bayesian researchers the main computational problem is integration, whereas for much of the rest of the community the focus is on optimization of model parameters. However, this dichotomy is not as stark as it appears: many gradient-based optimization methods can be turned into integration methods through the use of Langevin and Hamiltonian Monte Carlo methods27, 28, while integration problems can be turned into optimization problems through the use of variational approximations24.”
  11. To build flexible models, you can allow for many parameters (such as what is used in large-scale neural networks), or you can use models that are non-parametric
    1. Prior has fixed complexity, latter has complexity that grows with data size (“either by considering a nested sequence of parametric models with increasing numbers of parameters or by starting out with a model with infinitely many parameters.”)
  12. “Many non-parametric models can be derived starting from a parametric model and considering what happens as the model grows to the limit of infinitely many parameters.”
  13. Models with infinitely many parameters would usually overfit, but Bayesian methods don’t do this because they average over instead of fitting parameters
  14. Quick discussion of some Bayesian non-parametrics
    1. Gaussian Processes (cites “GaussianFace” a state of the art application to face recognition that beats humans and deep learning)
    2. Dirichlet Processes (can be used for time-series)
    3. “The IBP [Indian Buffet Process] can be thought of as a way of endowing Bayesian non-parametric models with ‘distributed representations’, as popularized in the neural network literature.”
  15. “An interesting link between Bayesian non-parametrics and neural networks is that, under fairly general conditions, a neural network with infinitely many hidden units is equivalent to a Gaussian process.”
  16. “Note that the above non-parametric components should be thought of again as building blocks, which can be composed into more complex models as described earlier. The next section describes an even more powerful way of composing models — through probabilistic programming.”
  17. Talks about probabilistic programming (like CHURCH)
    1. Very flexible <but computationally very expensive, often built on mcmc>
  18. Bayesian Optimization (like GP-UCB)
  19. Compression
  20. Compression and probabilistic modelling are really the same thing (shannon)
    1. Better model allows more compression
  21. Best compression algorithms are equivalent to Bayesian nonparametric methods
  22. Bayesian methods to make an “automatic statistician” (scientific model discovery)
  23. Central challenge in the field is addressing the computational complexity, although “Modern inference methods have made it possible to scale to millions of data points, making probabilistic methods computationally competitive with conventional methods”

Learning probability distributions over partially-ordered human everyday activities. Tenorth, De la Torre, Beetz. ICRA 2013

  1. Attempt “… to learn the partially ordered structure inherent in human everyday activities from observations by exploiting variability in the data.”
  2. Learns full joint probability over actions that make up a task, their partial ordering, and their parameters
  3. Can be used for classification, but also figuring out what actions are relevant to a task, what objects are used, if it was done correctly, or what is typical for an individual
  4. Use synthetic data as well as TUM Kitchen and CMU MMAC
  5. Use Bayesian Logic Nets (another paper has author overlap and uses the same approach)
  6. Common alternate approaches are HMMs, conditional random fields (CRFs), or suffix trees
    1. But these are most effective when the ordering of the subtasks are pretty fixed
    2. Also Markov assumption of HMM doesn’t really hold in the way data is often represented and may require all history information
  7. Also some other approaches for representing partial ordering
    1. <Whatever this means> “All these approaches focus only on the ordering among atomic action entities, while our system learns a distribution over the order as well as the action parameters.”
  8. Literature on partially ordered plans require lots of prior information, and have been applied to synthetic problems
  9. Working off the TUM dataset, they needed to resort to bagging and data noisification techniques in order to get enough samples
  10. Needs annotation / data labeling
  11. Learns, for example, that after making brownies (the CMU MMAC dataset) some people like to put the frying pan in the sink, and others on the stove
  12. Performance of this approach is much better than that of CRFs, and is more robust to variability in how people undertake tasks

Towards Automated Models of Activities of Daily Life. Beetz, Tenorth, Jain, Bandouch. Technology and Disability 2010.

  1. Proposes automated probabilistic models of everyday activities (AM-EvA)
    1. Works at differing levels of abstraction, starting at joint poses and trajectories
    2. Integrate different kinds of action models into one framework with a-priori knowledge about actions
  2. <Seems like their motivation is a bit different as> they are interested to see if activities degrade or change (forget what you were doing with dementia)
  3. But they also have “kitchen” activities recorded, but this is setting a table, including getting stuff from cabinets, also skeleton model
    1. No head-cam, it seems
    2. But data is very nicely annotated
  4. Data from multi-cam setup alone, tracks 51 DOF (from Bayesian particle filters)
  5. Mentions Gaussian Process Dynamical Models for embedding human motion data into latent space <need to check this out>
    1. This can “…serve as a starting point for the (unsupervised) segmentation of trajectories into meaningful fragments, whose sequential ordering in turn provides the input for a further interpretation of the overall action sequence in a (discrete) time-series model.  Such an unsupervised approach will group motions mainly with respect to their kinematic or dynamic properties.”
  6. Models here are generative, so can be used to predict, and even assign probabilities to motion sequences
  7. Discuss action segmentation, and then that combined with hierarchical segmentation
  8. Mentions that external descriptions of a behavior can also be incorporated <although its not clear how>.  Took data from ehow, parsed it, and then ran it through their segmented hierarchical data to try and see what the person was doing although it seems primarily speculative at this point as they write “We will explore this direction of research in the near future.”
  9.  Although doing the same activity multiple times will result in variability, some of the behavior will remain the same, they are seeking to find those regularities.
  10. Their AM-EvA framework supports two types of statistical relational models:
    1. Bayesian logic networks (BLN)
    2. Markov logic networks
    3. Both of these allow for representation of “meta-model of probability distributions, i.e. a template for the construction of a concrete probability distribution that can be represented as a graphical model… [either 1 or 2 immediately above]”
    4. Bayesian logic networks are a little more restrictive, but likewise they are easier to use, so its what the authors go with
  11.  BLN is a Bayesian network that does not allow constructs that conflict with provided logic
  12. <This stuff is pretty slick.  Right now I don’t think we are interested in discretizing behavior into segments, but if we do it in the end, this is a neat option to check out>
  13. Because things are segmented and hierarchically labeled, can query the corpus for certain behaviors (such as episodes where something was carried with both hands), and it can pull them out
    1. Can also be queried in other ways for “action-related concepts”, such as where was someone standing when doing a certain action.  This can be used, among other things, for user identification
    2. Can give probability estimates of what higher-order behavior is going on, given some lower order behavior

Sharing Features among Dynamical Systems with Beta Processes. Fox, Sudderth, Jordan, Willsky. NIPS 2009

  1. Bayesian nonparametric for modeling time series
    1. Beta process prior “approach is based on the discovery of a set of latent dynamical behaviors that are shared among multiple time series.  The size of the set and the sharing pattern are both inferred from data.”
  2. Develop efficient MCMC method based on indian buffet process
  3. “We specifically focus on time series where behaviors can be individually modeled via temporally independent or linear dynamical systems, and where transitions between behaviors are approximately Markovian.”
    1. Examples are HMM, switching vector autoregressive process, linear dynamical systems
  4. “Our approach envisions a large library of behaviors, and each time series or object exhibits a subset of these behaviors.  We then seek a framework for discovering the set of dynamic behaviors that each object exhibits.”
  5. Behaviors an object can exhibit is described in a feature list, N objects with K features can be described by a NxK matrix
    1. Beta process is used to infer # of features, Indian buffet process
  6. “Given a feature set sampled from the IBP, our model reduces to a collection of Bayesian HMMs (or SLDS) with partially shared parameters.”
  7. Also mention:
    1. HDP-HMM: “does not select a subset of behaviors for a given time series, but assumes that all time series share the same set of behaviors and switch among them in exactly the same manner.”
    2. Infinite factorial HMM: “models a single time-series with emissions dependent on a potentially infinite dimensional feature that evolves with independent Markov dynamics.”
  8. MCMC method is “efficient and exact”
  9. <This is a little heavy for my faculties at the moment so skimming>
  10. MCMC interleaves Metropolis-Hastings with Gibbs “We leverage the fact that fixed feature assignments instantiate a set of finite AR-HMMs, for which dynamic programming can be used to efficiently compute marginal likelihoods. Our novel approach to resampling the potentially infinite set of object-specific features employs incremental “birth” and “death” proposals…”
  11. Screen Shot 2014-11-03 at 2.03.04 PM
  12. Mocap experiments
  13. Discusses other methods that have been good for “describing simple human motion…”, “However, there has been little effort in jointly segmenting and identifying common dynamic behaviors amongst a set of multiple motion capture (MoCap) recordings of people performing various tasks.”  BP-AR-HMM does this
  14. Looked at 6 CMU exercise routines.  Original data was 62D, they manually selected 12 dimensions from that set, subsample in time as well
  15. Does a pretty nice job clustering motions

Adaptive Submodularity: A New Approach to Active Learning and Stochastic Optimization. Glovin, Krause, Ray. Talk (NIPS 2010)


  1. Basic idea is consider submodular functions, they they are stochastic and unfold over time.  So instead of making all decisions in one shot (before execution begins), better to wait to see how the first decision unfolds before making the second, etc.
    1. Gives the example of influence in social networks – if you want to get people to plug a product you give to a few people for free, see how influence is spreading from the first person before deciding the second
  2. The setting described here is a generalization of standard suboptimality
  3. Analysis here allows many results from standard submodular optimization to the adaptive setting
  4. Instead of maximizing margin, maximizes expected margin based on current state, but does this only for the first selection, and the results of the first selection on state then become conditions for the next choice
  5. Gets standard bounds of 1-1/e
  6. Because adaptive, can now do things like “take the minimal number of actions needed to achieve X”, which you couldn’t do with standard submodularity (there, you can just say “maximize the benefit of N choices”
    1. This is a log-approximation ln(n)+1 for stochastic set cover
  7. Another application optimal decision trees
    1. “Generalized binary search” which is equivalent to max info gain
    2. This problem is adaptive submodular <I thought things like xor in decision trees arent submodular, but maybe because we are talking w.r.t. information gain it is?>
    3. Give bounds for this approximation
    4. “Results require that tests are exact (no noise)!”
  8. If trying to do decision trees in noisy setting, can do a Bayesian version
    1. But generalized binary search, maxing info gain, and maxing value of info aren’t adaptive submodular in noisy setting.  In noisy settings they require exponentially more tests than optimum <Should really read this paper>
  9. Trick is to transform noisy problem into normal problem by making outcome part of the hypothesis
    1. Only need to distinguish between noisy hypotheses that lead to different decisions
  10. Gives bounds for this Bayesian active learning for noisy domains
  11. Give an example of testing different psyhological/economic models of how people perform decision making
    1. For this setting, turns out many common approaches actually do much worse than random

Bayesian Inference in MCTS. Tesauro, Rajan, Segal. UAI 2010.

  1. Is inspired by distribution-free (frequentist) approaches like UCT but uses Bayesian Inference to perform estimates of values
  2. Bayesian approach allows alg designers to leverage evaluation functions differently
  3. When assumptions hold, it is Bayes-optimal 
  4. Stochastic trial results at leaf nodes are combined with the prior to create posterior estimates of value in the nodes that are then propagated upwards, which occurs recursively
  5. For MDPs, distributions are based on a distributional max operator (there is a corresponding min for game trees)
    1. Likewise, there are distribution based upper-confidence bounds
  6. Bayesian estimates allow for faster convergence with limited data compared to UCT
  7. A paper is cited that says average value estimates are (sample) inefficient ad often underestimate the true value
  8. Computation costs are within an order of magnitude of UCT
    1. Because the approach is more sample efficient, it is advantageous in cases where sampling is expensive (such as Go)
  9. Bayesian MCTS is robust to sampling policies.  Convergence of UCT depends on focusing majority of trials on the optimal path
    1. Bayesian MCTS should better leverage parallelization / off policy sampling
  10. Max/min values are based on extremum distributions.  Whats that?
    1. It is a set of random variables
    2. Min, max values are then computed based on this set.  Can be done by integration or Gaussian approximation
  11. The modifications to UCT proposed are very simple
  12.  Gaussian approximations are nice because they are very nice.   Also, some simple calculation can take into account correlation of sibling leaf nodes
  13. Looks like worst-case scenarios that are possible when using Gaussian approximation don’t come up in practice
  14. The empirical domains are very simple, but reasonable

Ideas from: Memory-based Stochastic Optimization; Moore, Schneider (NIPS 1995)

  1. Basic setting is the stochastic (mutlidimensional) continuous bandit problem.
  2. The other competing algorithms for the domain at the time this paper was written don’t seem very sophisticated and are variants of gradient ascent.
  3. Uses a Bayesian method of locally weighted polynomial model, which gives a distribution over predictions for a query point.
  4. Noise is assumed, but the parameters are initially unknown (then learned).
  5. Claim the Bayesian approach allows for more reasonable behavior when the density of sampled points is low.
  6. The optimal answer is intractable, but four proposed approximation algorithms are listed, a comparison of those 4 and a few others are shown empirically in 2 different domains.
Tagged , , ,