So, on our Wednesday meeting, we discussed that it is very obvious what you would want to accomplish with dimensionality reduction in terms of classification, but that in terms of regression it was a little less clear what that meant.

I think I’ve come up with at least a way to think about what the difference is, although I’m not positive it helps with the pitfall work.

Right, so in the case of classification, you want to pick the features that maintain the distances between the classes maximally. In my hastily drawn example above (2 separate classes, even though its the same color), its of course very bad to throw away the X dimension, but throwing away Y doesn’t really hurt you very much (On the sides I drew in what the projections to each dimension kind of looks like).

So thats easy to understand. Turns out regression is pretty easy to understand as well, its just something I never bothered to think about before. Now, consider the following sinusoidal grating. Imagine that white = 1, black =0, and the gre(a)ys are of course intermediate values.

Aha, so here, its obvious that the X dimension is what you care about, and that the Y dimension is irrelevant. If you were doing regression, you could throw away the Y dimension without any loss of accuracy.

And just to complete the illustration, what would an instance of where the X and Y dimensions matter look like? Simply something like this:

In this example, there is variation in the function in terms of both the X and Y dimensions, so if we were doing regression to try and model this function, you would not want to throw away either dimension.

So what does this mean, does this give us away to try and figure out what dimensions are important for regression? Perhaps – the first thing that comes to mind is that in Figure 2, the derivative to the function with respect to Y is 0 everywhere, whereas the derivative with respect to X varies with X. In Figure 3, the partial derivatives with respect to both X and y vary continuously.

Maybe thats a way to think about it, in these very clear-cut oversimplified examples at least, which also assumes knowledge of the form of the function so that a partial derivative could be taken.

The first hacky idea that comes to mind if the function isn’t so simple, and if the form of the function is unknown would be to perhaps take many monte carlo samples of the function, and then from each of those randomly sampled points, vary X and Y independently small amounts and see which direction has greater change. The one that seems to cause less change in terms of the estimated derivative of the function would be the one you should throw out first. I’m sure there are better ways to go about this, and it may or may not even be useful, but there you go.