- First algorithm to beat human performance in Labeled Faces in the Wild dataset
- This has traditionally been a difficult problem for a few reasons:
- Often algorithms try to use different face datasets to help training, but the faces in different datasets come from different distributions
- On the other hand, relying on only one dataset can lead to overfitting
- So it is necessary to be able to learn from multiple datasets with different distributions and generalize appropriately

- Most algorithms for face recognition fall in 2 categories:
- Extracting low-level features (through manually designed approaches, such as SIFT)
- Classification models (such as NNs)

- “Since most existing methods require some assumptions to be made about the structures of the data, they cannot work well when the assumptions are not valid. Moreover, due to the existence of the assumptions, it is hard to capture the intrinsic structures of data using these methods.”
- GaussianFace is based on
*Discriminative Gaussian Process Latent Variable Model* - The algorithm is extended to work from multiple data sources
- From the perspective of information theory, this constraint aims to maximize the mutual information between the distributions of target-domain data and multiple source-domains data.

- Because GPs are in the class of Bayesian nonparametrics, they require less tuning
- There are optimizations made to allow GPs to scale up for large data sets
- Model functions both as:
- Binary classifier
- Feature extractor
- “In the former mode, given a pair of face images, we can directly compute the posterior likelihood for each class to make a prediction. In the latter mode, our model can automatically extract high-dimensional features for each pair of face images, and then feed them to a classifier to make the final decision.”

- Earlier work on this dataset used the Fisher vector, which is derived from a Gaussian Mixture Model
- <I wonder if its possible to use multi-task learning to work on both the video and kinematic data? Multi-task learning with GPs existed before this paper>
- Other work used conv nets to take faces from different perspectives and lighting to produce a canonical representation, other approach that explicitly models face in 3D and also used NNs, but these require engineering to get right
- “hyper-parameters [of GPs] can be learned from data automatically without using model selection methods such as cross validation, thereby avoiding the high computational cost.”
- GPs are also robust to overfitting
- “The principle of GP clustering is based on the key observation that the variances of predictive values are smaller in dense areas and larger in sparse areas. The variances can be employed as a good estimate of the support of a 3 probability density function, where each separate support domain can be considered as a cluster…Another good property of Equation (7) is that it does not depend on the labels, which means it can be applied to the unlabeled data.”
- <I would say this is more of a heuristic than an observation, but I could see how it is a useful assumption to work from>
- Basically it just works from the density of the samples in the domain
- <Oh I guess I knew this already>

- “The Gaussian Process Latent Variable Model (GPLVM) can be interpreted as a Gaussian process mapping from a low dimensional latent space to a high dimensional data set, where the locale of the points in latent space is determined by maximizing the Gaussian process likelihood with respect to Z [the datapoints in their latent space].”
- “The DGPLVM is an extension of GPLVM, where the discriminative prior is placed over the latent positions, rather than a simple spherical Gaussian prior. The DGPLVM uses the discriminative prior to encourage latent positions of the same class to be close and those of different classes to be far”
- “In this paper, however, we focus on the covariance function rather than the latent positions.”
- “The covariance matrix obtained by DGPLVM is discriminative and more flexible than the one used in conventional GPs for classification (GPC), since they are learned based on a discriminative criterion, and more degrees of freedom are estimated than conventional kernel hyper-parameters”
- “From an asymmetric multi-task learning perspective, the tasks should be allowed to share common hyper-parameters of the covariance function. Moreover, from an information theory perspective, the information cost between target task and multiple source tasks should be minimized. A natural way to quantify the information cost is to use the mutual entropy, because it is the measure of the mutual dependence of two distributions”
- There is a weighing parameter that controls how much the other data sets contribute
- Optimize with scaled conjugate gradient
- Use anchor graphs to work around dealing with a large matrix they need to invert
- “For classification, our model can be regarded as an approach to learn a covariance function for GPC”
- <Not following the explanation for how it is used as a feature generator, I think it has to do with how close a point is to cluster centers>
- Other traditional methods work well here (like SVM, boosting, linear regression), but not as well as GP <Is this vanilla versions or on the GP features?>
- Works better as a feature extractor than other methods like k-means, tree, GMM
- “Deepface” was the next-best method
- It is only half-fair to say this beats human performance, because human performance is better in the non-cropped scenario, and this is in the cropped scenario.
- <My guess is that in the non-cropped scenario, machine performance conversely degrades even though human performance increases>

- Performance could be further increased but memory is an issue, so better forms of sparsification for the large covariance matrix would be a win