- How do you select a subset of
*k*variables from a large set in order to get the best linear prediction of another variable?- To be clear, its about choosing dimensions of data, as opposed to selecting subset of whole data points

- Related to feature selection and sparse approximation
- Analyzes performance of commonly used, empirically effective greedy heuristics
- Analysis based on submodularity and spectral analysis

- Introduces the “…
*submodularity ratio*as a key quantity to help understand why greedy algorithms perform well even when the variables are highly correlated.” - Get best approximation guarantees in terms of both submodularity ratio as well as “… smallest
*k*-sparse eigenvalue of the covariance matrix.” - Also get better bounds on dictionary selection problem <not sure what that is>
- Test on real-world as well as synthetic data
- Results show submodularity ratio is a better predictor of performance of greedy algorithms than other spectral parameters

- Commonly, after subset selection is performed, the goal is to minimize MSE or maximize squared multiple correlation R
^{2}<whats that? sounds neat>. Here they focus on the latter - Selection criteria is based on covariance matrices
- Minimizing R
^{2}is “… equivalent to the problem of*sparse approximation*over dictionary vectors…” - Formulation is similar to that of
*sparse recovery*, except there the assumption is the data is actually*k*-sparse, but in many cases that assumption doesn’t hold and the data is actually dense **The problem as specified is NP-Hard, so heuristics are the only way to go**- Common heuristics are greedy methods, or those based on convex relaxation. Here they focus on the former, as the methods are generally simpler and require less tuning
- Of the greedy methods, most common are
*Forward Regression*and*Orthogonal Matching Pursuit* - Previous theoretical work on these greedy methods have been lacking in applicability
- “We prove that whenever the submodularity ration is bounded away from 0, the R
^{2}objective is ‘reasonably close’ to submodular, and Forward Regression gives a constant-factor approximation.” - Although they mention issues with spectral methods in this context (“… greedy algorithms perform very well, even for near-singular input matrices.”) the covariance ratio is related to spectral analysis:
- The submodularity ratio “… is always lower-bounded by the smallest
*k*-sparse eigenvalue of the covariance matrix [but can be much larger].”

- The submodularity ratio “… is always lower-bounded by the smallest
- They also get bounds for the two greedy methods when the lowest
*k*-sparse eigenvalue is non-zero - In comparison between performance as related to submodularity ratio vs spectral analysis “… while the input covariance matrices are close to singular, the submodularity ratio actually turns out to be significantly larger. Thus our theoretical results can begin to explain why, in many instances, greedy algorithms perform well in spite of the fact that the data may have high correlations.”
- R
^{2}describes the fraction of the variance of the desired output values*Y*that are explained by variables in the*X*inputs in the corpus - Forward regression is the natural greedy method, where the variable that is added at each step is the one that most maximizes R
^{2}immediately - Give a bound for how much the R
^{2}values of a set and a sum of its elements can differ, used for the proof of performance of forward regression - <Skipping actual proofs, onto empirical results>
- “Because several of the spectral parameters (as well as the optimum solution) are NP-hard to compute, we restrict our experiments to data sets with
*n<*30 features, from which*k*<8 are to be selected. We stress that the greedy algorithms themselves are very efficient.” - Data sets are World Bank development indiccators, house prices in Boston w/relevant features, and then a synthetic set
- Compared to optimal (exhaustive) results, the greedy algorithms perform extremely well
- The greedy methods work better than Lasso
- The bound based on the submodularity ratio is much better than that of the spectral parameters
- While there is a fair amount of looseness between the theoretical bounds and actual performance (theoretical is pessimistic), the difference between actual and theoretical is not absurdly different
- They also include an extension to the theory that drastically tightens the bounds <why not put this in the theoretical section in the first place, as opposed to with the empirical results?>

- The real-world data sets are nearly singular