First algorithm to beat human performance in Labeled Faces in the Wild dataset
This has traditionally been a difficult problem for a few reasons:
Often algorithms try to use different face datasets to help training, but the faces in different datasets come from different distributions
On the other hand, relying on only one dataset can lead to overfitting
So it is necessary to be able to learn from multiple datasets with different distributions and generalize appropriately
Most algorithms for face recognition fall in 2 categories:
Extracting low-level features (through manually designed approaches, such as SIFT)
Classification models (such as NNs)
“Since most existing methods require some assumptions to be made about the structures of the data, they cannot work well when the assumptions are not valid. Moreover, due to the existence of the assumptions, it is hard to capture the intrinsic structures of data using these methods.”
GaussianFace is based on Discriminative Gaussian Process Latent Variable Model
The algorithm is extended to work from multiple data sources
From the perspective of information theory, this constraint aims to maximize the mutual information between the distributions of target-domain data and multiple source-domains data.
Because GPs are in the class of Bayesian nonparametrics, they require less tuning
There are optimizations made to allow GPs to scale up for large data sets
Model functions both as:
“In the former mode, given a pair of face images, we can directly compute the posterior likelihood for each class to make a prediction. In the latter mode, our model can automatically extract high-dimensional features for each pair of face images, and then feed them to a classifier to make the final decision.”
Earlier work on this dataset used the Fisher vector, which is derived from a Gaussian Mixture Model
<I wonder if its possible to use multi-task learning to work on both the video and kinematic data? Multi-task learning with GPs existed before this paper>
Other work used conv nets to take faces from different perspectives and lighting to produce a canonical representation, other approach that explicitly models face in 3D and also used NNs, but these require engineering to get right
“hyper-parameters [of GPs] can be learned from data automatically without using model selection methods such as cross validation, thereby avoiding the high computational cost.”
GPs are also robust to overfitting
“The principle of GP clustering is based on the key observation that the variances of predictive values are smaller in dense areas and larger in sparse areas. The variances can be employed as a good estimate of the support of a 3 probability density function, where each separate support domain can be considered as a cluster…Another good property of Equation (7) is that it does not depend on the labels, which means it can be applied to the unlabeled data.”
<I would say this is more of a heuristic than an observation, but I could see how it is a useful assumption to work from>
Basically it just works from the density of the samples in the domain
<Oh I guess I knew this already>
“The Gaussian Process Latent Variable Model (GPLVM) can be interpreted as a Gaussian process mapping from a low dimensional latent space to a high dimensional data set, where the locale of the points in latent space is determined by maximizing the Gaussian process likelihood with respect to Z [the datapoints in their latent space].”
“The DGPLVM is an extension of GPLVM, where the discriminative prior is placed over the latent positions, rather than a simple spherical Gaussian prior. The DGPLVM uses the discriminative prior to encourage latent positions of the same class to be close and those of different classes to be far”
“In this paper, however, we focus on the covariance function rather than the latent positions.”
“The covariance matrix obtained by DGPLVM is discriminative and more flexible than the one used in conventional GPs for classification (GPC), since they are learned based on a discriminative criterion, and more degrees of freedom are estimated than conventional kernel hyper-parameters”
“From an asymmetric multi-task learning perspective, the tasks should be allowed to share common hyper-parameters of the covariance function. Moreover, from an information theory perspective, the information cost between target task and multiple source tasks should be minimized. A natural way to quantify the information cost is to use the mutual entropy, because it is the measure of the mutual dependence of two distributions”
There is a weighing parameter that controls how much the other data sets contribute
Optimize with scaled conjugate gradient
Use anchor graphs to work around dealing with a large matrix they need to invert
“For classification, our model can be regarded as an approach to learn a covariance function for GPC”
<Not following the explanation for how it is used as a feature generator, I think it has to do with how close a point is to cluster centers>
Other traditional methods work well here (like SVM, boosting, linear regression), but not as well as GP <Is this vanilla versions or on the GP features?>
Works better as a feature extractor than other methods like k-means, tree, GMM
“Deepface” was the next-best method
It is only half-fair to say this beats human performance, because human performance is better in the non-cropped scenario, and this is in the cropped scenario.
<My guess is that in the non-cropped scenario, machine performance conversely degrades even though human performance increases>
Performance could be further increased but memory is an issue, so better forms of sparsification for the large covariance matrix would be a win
[this is based on a quick 2nd read through, so less notes than usual]
Aggregates over a set of kernel functions in order to do ranking and selection more effectively
The weighing scheme used is shown to be optimal under independent kernel estimators (need to find out what that means exactly)
Uses knowledge gradient to select sampling points (roughly uses GPs and then estimates gradient of the bounds, and then attempts to sample where the gradient is highest, but a separate optimization algorithm is needed to find those locations)
Proposed policy is shown to be asymptotically optimal
This paper is concerned with searching over a finite number of options (computationally, restricted to some thousands)
Similarity is measured by a kernel. Although there is no explicit use of lipshitz smoothness or concavity, kernel metrics must be Holder
In the off line setting, the problem addressed is called ranking and selection, but in on-line its called multi armed bandit
References for earliest papers on stochastic search
Noise is assumed to be normally distributed
Policy is 1-step optimal, and converges to optimal at the limit
In their empirical results they use 0-mean 1D gaussian processes. They test this method, random sampling, and another GP-based method (which attempts to find the length-scale online (SKO)? theirs just weighs a bunch of provided kernels, finding the best one)
For differing kernels, used linear fits but with different bandwidths
Their stuff does very well, although with larger length-scales SKO works better. For GPs that actually have very short length-scales, KGCB is better.
Also, SKO is less general, and only works well on certain classes of functions (certain covariance functions cause trouble, for example). Seems to be a little more fragile.
I respect them for pointing out situations where SKO outperforms KGCB
In an energy storage/retrieval problem, their method outperforms SKO, although with larger amounts of noise the bounds begin to overlap. In this problem KGCB converged more quickly, while in some of the earlier results SKO converged more rapidly, so depends on the domain.
“Although our policy performs very well in the numerical experiments, there is a caveat. Kernel estimation is known to suffer from the curse of dimensionality as the MSE proportional to h^d where h is the bandwidth and d is the number of dimensions. If observations lie in high dimensional spaces, non-parametric estimation is known to have poor performance. Because of these reasons, the efficiency of our estimation method also degenerates in higher dimensions. Additive models might be used to handle this curse but this requires making more assumptions on the structure of the function.”
This idea came to me the evening after our meeting on Wednesday. For some reason I was thinking about modeling the ghost movement in Pac-Man when I thought of it, but it seemed like it may be well suited to Pitfall! (and hopefully many other things as well).
Here’s the idea in a few sentences: Given that you are observing state, action transitions of some MDP with no knowledge of the underlying MDP, how would you go about estimating the order of the MDP? Maybe its reasonable to try using standard regression techniques, and try regressing a number of functions, and see which does best. What I mean by a number of functions is build each regressor to use a different amount of history (the most recent time steps), and then whichever regressor performs best tells us an estimated order of the MDP.
Yes, using past frames is definitely a way to make better predictions.
Cool – is this general idea already a well-known / explored one?
Using history to make up for missing state information? Sure.
Lets go back to Pitfall! for a second – at a couple of meetings I was at we were at we discussing how to model jumping, and the more we thought about it, the more complicated the idea got. As I recall the basic thing we were saying was, well if the emulator gives you one snapshot of the guy in the air, how do you know where he’ll be when he lands?
It seems to me like thats a very restricted way of looking at the problem, and that if you look at not just the most recent snapshot, but also a number of previous snapshots, it should be pretty simple to figure out where the trajectory is going to take Pitfall Harry when he lands. Thats because you can’t estimate velocity with just one sample – if you see Pitfall Harry in the air he could just as easily be jumping straight up or forward – there’s no way of knowing.
Right? So more generally, for some moving body, consider that you are given snapshots every tick, and the goal is to guess where its going to be next. The way I see things, given the data we have, position is an order 1 variable, but velocity is an order 2 variable. So given the most recent snapshot, and the one before, we know both where it is and where its going. Would having a history of the past 10 snapshots help? Not really if the dynamics are simple (I’m using this as a concrete example, but I think this is more generally applicable).
So, I wanted to test the idea out, and since it seemed like it lended itself well to the Pitfall! work, I figured I’d hack up a simple physics engine and track the movement of a body which has dynamics basically as follows:
It can “run” to the left or the right, doing so repeatedly will keep increasing velocity until some maximum velocity is reached.
There is friction, so if its going quickly to the right and tries to go left, it will continue going right, just slower. Exerting still more force to the left will eventually make it move left.
Simple gravity, it can “jump” if it is currently on the floor. While in the air, trying to run or jump don’t work.
There are walls, if it hits the wall it just stops there.
Cool. Although, since you hacked it up, it violates the RL3 prime directive. Nonetheless, it makes it easier to collect data fast at this stage, so I’m not objecting.
Fair enough – right I did this more as a proof of concept, I’d be happy to try it on things not of my own doing in the future, I was thinking aside from pitfall, stocks might also be an interesting application? Just an idea, could be anything.
I’d be more inclined to some other game. Stocks are a bit of a
loaded concept right now.
Haha – fair enough.
I had the agent behave randomly, with the following behavior: Each tick go left, right, or jump each with 5% probability. This actually gives pretty good and varied movement. I logged the position and action at each time tick, where the action is the force exerted to the left, right, or up (even if it was in the air and that force did nothing). I have it all nicely animated to keep me entertained.
Once I had all this logged, I assembled GP regressors, one to predict the X position and one to predict the Y position. I made a number of them, from what I will call “order” 0 to order 5 (although I have a feeling I’m abusing this term). Anyway, the value representation for each order goes as follows:
Order 0: you know nothing
Order 1: current values of [x, y, force_x, force_y]
Order 2: everything in order 1, and also those values for the step preceding whats in order 1
Order 3: everything in order 2, and also those values for the step preceding whats in order 2
So, how much data do you need to most accurately predict where the agent will be next? Clearly in order 0 you are hosed. As I explained above, order 1 is pretty tough, because you don’t have any idea of velocity (but you do know your change in velocity from the action), and order 2 gives you what you need. My prediction was that eventually more isn’t better, so there should be some bowl shaped mean error function over order – too little isn’t good because there’s no way to do the estimates, but too much is also bad because your training points may not have something close to what you just saw (imagine an order 50 regressor, good luck finding a training point similar to that), making estimation tough.
Or “data sparsity”, more generally.
So, here’s the show me the money section:
The x component is red, the y component is green. Both the train set and the test set were 2000 samples large.
mean absolute error, x-axis is order of regressor, y is error
mean squared error, x-axis is order, y-axis is error^2
Ok, that order 0 regressor sucks. Lets re-plot that starting at order 1 so we can actually see whats going on.
mean absolute error, order 1 to 5
mean squared error, order 1 to 5
Now lets make it pop by summing both x and y errors:
x + y mean absolute error
x + y mean squared error
So that pretty clearly shows that order 2 is the best way to go in terms of both mean absolute error and MSE. I claim success.
And to tie it up, the variance:
mean GP variance over sample set
Which I interpret as the higher the order, the more difficult it is to find similar points, which makes sense because the state space is exploding. This explains why you can’t just throw the kitchen sink at the problem and call every MDP an order 1000 MDP and just have it work. With an order too small you cant do the estimate, but with an order too high the variance just causes lots of trouble. That is evident in the apparent superlinear growth of error as the actual order of the MDP is exceeded.
Great job—really interesting insight, intriguing results. Do you really think the GP part is helping here, or would any learning algorithm have done more or less the same thing?
Thank you very much. I originally did think that GP would be special for this, and that I would be able to use the variance to plot these curves instead of the error, but that was due to a poor understanding of GPs on my part. It turns out the variance has nothing to do with the variability of the value you are trying to estimate, but rather just the distance of the query point from what its already seen, which is why the variance for both the x and y GPs are identical, even though the x value is more difficult to predict than the y variable (because gravity makes the agent spend most of the time on the floor).
I’ll try this out with other regression algorithms just to confirm.
I left a comment on the page suggesting that maybe it’s time to get some data from Pitfall! and try running on that. It might also be worth comparing to some other approach so we can see if GP is
responsible for the success or if you just set up the problem so well that any method would succeed.
I think both would be great. I mentioned the general idea to Carlos before I had results and he seemed interested to see what the results would be. I’ll try some other applications in domains I didn’t craft up and see how things go.
I’m wondering if we should try again to form a Pitfall! team and really do something dramatic…
I would be totally pumped. I think I’m gonna try and work on the Pitfall logs in simulation next and see if it finds that Harry’s location is not useful to help predictions, in simulation for now, but hopefully I can get the emulator running (or at least some data) and then try it there.
I’d think Carlos or Andre could give you traces from the crocodiles or jumping or vine swinging—something with non-Markovian behavior you could try to model.
But aren’t these things Markovian if represented by an MC with high enough order? For example, if i just look at the crocodiles now, I can’t really figure much of anything out, but given enough history I should be able to estimate almost perfectly how the mouths open and close – I just need the order of the MC to be long enough to capture one period of the mouth opening and closing?
Sorry, I was speaking informally. I just meant fully observable, Markov order 0.
Yes, there are several levels at work here, although I’m not sure
quite what to call them:
* non-Markov, non-stationary: transitions can’t be captured modeled using “states”.
* infinite-dimensional POMDP: A weird class that can be modeled by PSRs but probably not learnable or useful.
* finite-state POMDP: Can be modeled by an underlying Markov process that is not completely observable.
* finite-order Markov: Special kind of POMDP where the most recent k observations can be used as state.
* Markov, completely observable: Most recent observation can be used as state (k=1).
Something else I think is cool is that if you can learn these models, John and Istvan are thinking about how to do the planning part. We could glue all the pieces together and do something really neat.
That sounds excellent.
I should also point out I believe this will solve other problems in Pitfall, such as the ability to figure that log movement isn’t conditional on player movement, and that scorpion movement is. Its basically just a technique that says what variables help predict a function and what don’t. Additionally, I thought the ability of the GP to report variance at a point would help, but it really doesn’t, since its the same for both x and y trained GPs.
For this experiment, I used WEKA’s implementation of Gaussian Processes to determine its performance in the Robocode domain. Initially, I was hopeful that performance would be good as it would be possible to model the mechanics of the game more accurately. Specifically, I could now model exactly how much energy was lost when the agent was hit, and exactly how much damage was done when a shot hit the enemy (since Robocode allows for shots of varying power). Performance was actually poor. When including the impact of missed shots (energy lost when shooting that is not regained), performance surprisingly was even poorer (but I won’t mention any more about that experiment here, as results were similar enough).
A large issue is the cost (in terms of both time, and memory) of the GP algorithm. For these experiments, I used WEKA’s implementation, which is most likely a pretty good one. Even still, on my machine (2gb ram, 2 core), it could only handle a maximum of 2,500 samples before running out of memory, and training took about 15 minutes between epochs. The J48 decision tree algorithm used for the other experiments, on the other hand, could easily handle tens of thousands of sample points.
Since the GP algorithm cannot handle a large number of sample points (I suppose its the n^3 costs of matrix operations), I had to shorten the epochs used in these experiments. In the experiments used with the J48 algorithm, the epochs were 1,000 rounds. Here, I had to reduce that to 100 rounds, because 1,000 rounds would produce more points than could even be used, and that would mean much learning couldn’t even be accomplished after the first epoch.
I ran for a total of 7 epochs, because after that most of the data found couldn’t be used anyway (more than 2,500 sample points per advisor). When I ran the experiments, if there were more than 2,500 samples for each advisor, I randomly discarded samples until there were only 2,500 samples.
Below are the graphs of the values WEKA’s GP algorithm found based on x, y coordinates and advisor. The values below are the result of training data for the final epoch.
(Click on the images for full-resolution version. Note that the range of values on each graph are different, even though the color range is the same for all graphs.)
Using these values, the following policy is found:
As you can see, the policy found here looks somewhat “right,” but has some flaws. The general characteristic is to use WallBot in the middle of the stage and then to transition to TrackFire when closer to the walls is good. There are, however, some problems which turn out to be quite significant. Firstly, there is a relatively large amount of area where SpinBot is the preferred advisor, and in this domain SpinBot should not be used at all. Secondly, the TrackFire sweet spot is a very thin linear band along each wall. As you can see, the regions where GP prefers TrackFire are somewhat curved, and tend to bulge in the corners. This leads to the policy generally transitioning either to the wrong advisor (SpinBot), or transitioning to the “correct” advisor (TrackFire), but doing so too early (outside of the sweet spot), which leads to very poor performance.
Basically, it was unable to find the sweet spot.
A plot of the performance metrics over epochs is below:
What seems to me to be the problem with this approach is the “fuzziness” of it. Based on the coloring of the value plots, it is clear there is a slow transition of values between nearby points; nearby points have similar values. The problem in this domain is that points which are very close in terms of their cartesian coordinates can actually have very different characteristics. Also, the boundaries found here should be linear, but GP finds boundaries which are curved, which is again a manifestation that nearby values have similar scores (and also perhaps evident in the “bulging” of TrackFire in the corners, as TrackFire goodness from both walls combines in the corners).
Specifically, the difference between being in, or outside of the sweet spot is literally a few pixels. If “goodness” from a location in the sweet spot bleeds outside of the sweet spot, it causes the agent to transition from, say WallBot to TrackFire too far away from the wall.
Oddly, one of the “problems” with classic decision trees, which is that they only draw axis-aligned boundaries actually happens to be the correct way to develop a policy in this domain. In other more “well behaved” problems, this may turn out to be a defficiency in decision trees and an advantage for GP; the opposite is the case here, however.
The other seemingly large problem was the efficiency of the algorithm, which was mentioned above. If it could use more sample points to learn from it may be possible learn the correct policy. By the 7th epoch there were already more samples taken than could be used, and this was only the result of 700 rounds of training. For the experiments with J48 on the other hand, 700 rounds was only 7/10 of the way through the first epoch.
Playing with the parameters may also help the problem. For example, reducing the “smoothness” of the value functions for each advisor may allow for a better policy, but based on these results it seems that at best the performance will be equivalent to that found with decision trees.