- “discusses some of the state-of-the-art advances in the field, namely, probabilistic programming, Bayesian optimization, data compression and automatic model discovery.”
- Given some observed data, there can be many (often infinite) models consistent with the data. Uncertainty comes up in terms of making the model, and then having the model produce predictions. “Probability theory provides a framework for modelling uncertainty”
- “…he scope of machine-learning tasks is even broader than these pattern classification or mapping tasks, and can include optimization and decision making, compressing data and automatically extracting interpretable models from data.”
- “Since any sensible model will be uncertain when predicting unobserved data, uncertainty plays a fundamental part in modelling.”
- “There are many forms of uncertainty in modelling. At the lowest level, model uncertainty is introduced from measurement noise, for example, pixel noise or blur in images. At higher levels, a model may have many parameters, such as the coefficients of a linear regression, and there is uncertainty about which values of these parameters will be good at predicting new data. Finally, at the highest levels, there is often uncertainty about even the general structure of the model: is linear regression or a neural network appropriate, if the latter, how many layers should it have, and so on.The probabilistic approach to modelling uses probability theory to express all forms of uncertainty
^{9}. Probability theory is the mathematical language for representing and manipulating uncertainty^{10}, in much the same way as calculus is the language for representing and manipulating rates of change.” - Learning occurs by taking a prior, adding data, and producing a posterior
- Probability theory is composed of the product rule and sum rule
- “The dominant paradigm in machine learning over the past two decades for representing such compositional probabilistic models has been graphical models
^{11}, with variants including directed graphs (also known as Bayesian networks and belief networks), undirected graphs (also known as Markov networks and random fields), and mixed graphs with both directed and undirected edges (Fig. 1). As discussed later, probabilistic programming offers an elegant way of generalizing graphical models, allowing a much richer representation of models. The compositionality of probabilistic models means that the behaviour of these building blocks in the context of the larger model is often much easier to understand than, say, what will happen if one couples a non-linear dynamical system (for example, a recurrent neural network) to another.” - Computationally, a problem in many models is integration required to sum out variables not of interest. In many cases, there is no poly-time algorithm.
- Approximation is possible, however, by MCMC and sequential monte-carlo

- “t is worth noting that computational techniques are one area in which Bayesian machine learning differs from much of the rest of machine learning: for Bayesian researchers the main computational problem is integration, whereas for much of the rest of the community the focus is on optimization of model parameters. However, this dichotomy is not as stark as it appears: many gradient-based optimization methods can be turned into integration methods through the use of Langevin and Hamiltonian Monte Carlo methods
^{27, 28}, while integration problems can be turned into optimization problems through the use of variational approximations^{24}.” - To build flexible models, you can allow for many parameters (such as what is used in large-scale neural networks), or you can use models that are non-parametric
- Prior has fixed complexity, latter has complexity that grows with data size (“either by considering a nested sequence of parametric models with increasing numbers of parameters or by starting out with a model with infinitely many parameters.”)

- “Many non-parametric models can be derived starting from a parametric model and considering what happens as the model grows to the limit of infinitely many parameters.”
- Models with infinitely many parameters would usually overfit, but Bayesian methods don’t do this because they average over instead of fitting parameters
- Quick discussion of some Bayesian non-parametrics
- Gaussian Processes (cites “GaussianFace” a state of the art application to face recognition that beats humans and deep learning)
- Dirichlet Processes (can be used for time-series)
- “The IBP [Indian Buffet Process] can be thought of as a way of endowing Bayesian non-parametric models with ‘distributed representations’, as popularized in the neural network literature.”

- “An interesting link between Bayesian non-parametrics and neural networks is that, under fairly general conditions, a neural network with infinitely many hidden units is equivalent to a Gaussian process.”
- “Note that the above non-parametric components should be thought of again as building blocks, which can be composed into more complex models as described earlier. The next section describes an even more powerful way of composing models — through probabilistic programming.”
- Talks about probabilistic programming (like CHURCH)
- Very flexible <but computationally very expensive, often built on mcmc>

- Bayesian Optimization (like GP-UCB)
- Compression
- Compression and probabilistic modelling are really the same thing (shannon)
- Better model allows more compression

- Best compression algorithms are equivalent to Bayesian nonparametric methods
- Bayesian methods to make an “automatic statistician” (scientific model discovery)
- Central challenge in the field is addressing the computational complexity, although “Modern inference methods have made it possible to scale to millions of data points, making probabilistic methods computationally competitive with conventional methods”