Sparse modeling for high-dimensional data. Bin Yu. ITA 2011 Tutorial Video

1. Structure:
1. V1 through fMRI
2. Occam’s Razor and Lasso
3. Unified theory: M-esitmation with decomposable regularization
4. Learning about V1 through sparse models
5. World-imaging through sparse modeling and human experiment
2. fMRI – can we decode what a person was looking at by examining fMRI signals?
1. This was done previously in a classification version of the task where there was a set of images <100?> and the goal was to figure out which of them were being looked at
3. She did a database of 10k images
4. From fMRI data, get feature vector that is 10921D
5. Goal: in order to help understanding of V1, need to develop a sparse model that is performs accurate prediction
1. Minimizing L2 loss leads to both ill-posed computational problem, and poor prediction
6. Worked with a lab that in 2006 tried neural nets, SVMs, and then settled on Lasso
1. Had consistency problems with NNs, lasso more stable
7. Fisher in the 1920s promoted ML methods (turns Bayes posterior with uniform prior into something likelihood based)
1. BUT Max likelihood with least squares leads to the largest model because least squares is a projection that leaves a large subspace (bigger space, smaller mean squared error), leads to problem of poor prediction power
8. <Under assumptions?> there are 2^D possible models <size of hypothesis space>.  Finding the best is intractable, but on the other hand, due to noise and tiny size of data w/respect to 2^D it doesn’t make sense to do exhaustive search anyway (as it will not give you the right answer).  Therefore, we have a good reason to use tractable, but suboptimal methods
9. Akaike information criterion is L-zero penalty/regularization (from the 70s)
1. Schwartz came up with Bayes Information Criterion which is also L0
10. L1/Lasso was introduced by Chen, Donoho in the 90s.
11. Properties of Lasso:
1. Sparsity and regularization
2. Convex relaxation of L0 penalty
12. Lasso happens to use L2 loss, but you can use anything in general of course.  Also Lasso happens to do L1 regularization, but can use others for that as well (L2 is ridge).
13. They propose a nonlinear system that works better for the fMRI problem than the linear one (although they are in the same ballpark)
14. <Ok, bailing>