Right, so I totally forgot about writing these, so it was good you sent out the email.
First I suppose let me write about my thoughts on the NYAS conference.
I didn’t feel like I got much out of the two pre lunch talks – the first was too geared towards biological research and the second a bit too theoretical. I enjoyed the student talks, all pretty short and sweet. Aside from the work that came from our lab, I found the work that leveraged the information bottleneck method pretty interesting, but didn’t have time to talk to the author because there was always somebody asking questions during the poster session.
I enjoyed both the after lunch talks, and I thought Robert Kleinberg’s was the best talk of the day. It seems like he was totally theory driven in the design of the algorithm – since there were no actual results presented it seemed to me that this was never even actually implemented? Since it was based on UCB it was also related to UCT I suppose.
I’m not sure if much of it (being anything in the entire conference) was really directly applicable to Bayesian RL, but there was definitely a lot of good perspectives around, and I found it very worthwhile to see what else is going on generally in the machine learning field right now. Seemed like a pretty diverse group of work (with clumps on SVMs and graph manipulation).
Ok, that being said, time to move on to what I’ve been working on this week.
So, I suppose there is good news and bad news.
The good news is, that if you recall when we met on Wednesday I said I had that idea about forming regression estimates on demand by sampling points closest to some query point, and only using those local points to form a local model in that small region, the point of which was to get around the O(n^3) cost with GPs. I had thought that there was another paper that did this already with GPs, but as you suggested once I read the paper, they actually did something different. It seems to be yet another paper that discusses cutting up the state space and using different GPs in different regions.
So thats the good news.
The bad news is that when I got some nice juicy data sets and hacked up the algorithm, the results weren’t spectacular. My laptop has 2 gigs of ram, and using weka’s GP implementation I can generally run GP on 1 gig of ram with data sets up to about 2,000 points. So, the way I set up the “algorithm to beat” was to randomly subsample down to 2,000 points (the data sets that I was using that were larger than that but I’ll get there later), and then do GP regression the classic way with just those 2,000 points to estimate the function.
The new algorithm was to hold onto all the points, and to generate new function estimators for every query, using the training points closest to the query point to do the regression with GP. When figuring out how many points to use for the local model, I decided that since sorting the training points by distance from a query point is o(n lg n), I should allow myself that much time to run the GP alogrithm, so I sampled down to (n lg n)^(1/3) points. Seemed reasonable to me.
Right, so that was the setup. So far, I ran bot the algorithm on two data sets. The first data sat came from Rui Camacho, and is data he generated based on flight dynamics of the F-16 (Lockheed!) That originally had about 8,000 training points and another 8,000 test points, but I shufled things around so only 10% of the points were used for testing. This left me with an estimate to use 60 points to generate the local GPs.
In this domain, the errors were as follows:
Classic mean absolute error: 0.0017
Classic mean squared error: 5.5210E-6
Local GP mean absolute error: 0.0030
Local GP mean squared error: 1.9811 E-5
So in terms of absolute error, my proposed algorithm is almost twice as bad as subsampling. I found another data set of comprable size (about 20,000 points total, as opposed to 16,000 total) in Pace, R. Kelley and Ronald Barry’s housing data (I think it was listed as UCI). This resulted in using 65 local points per query.
Here, the results are as follows:
Classic mean absolute error: 0.4868
Classic mean squared error: 0.4054
Local GP mean absolute error: 0.8015
Local GP mean squared error: 0.9548
So the results relatively are pretty similar to the first experiment, and not great in my opinion.
Of course, we can start playing with the number of samples we use for local regression – hey, its just an order estimate, so constant factors are ok, right? Tripling the number of points used in the second domain (brings it to 200) changes the accuracy to:
Local GP mean absolute error: 0.5679
Local GP mean squared error: 0.5214
Which is close to (but still worse than) just subsampling. The time it takes to run also skyrockets…
So I think whats really needed is some truly enormous data set to hopefully demonstrate this approach is useful. Any ideas? Another approach is to just see how this does in comparison to other less sophisticated regression techniques, but I think that would be a little less satisfying.
With another data set (another housing set from DELVE) results were a bit more interesting. When using 100 local points, absolute error was worse, but squared error was better:
Subsampled: Mean absolute error = 21912.826023529196
Subsampled: Mean squared error = 1.6453186474742997E9
Local GP: Mean absolute error = 2817.647136007485
Local GP: Mean squared error = 2.2057028603722206E8