Feature Selection

Alex Egg,

There are a few automated techniques for feature selection in machine learning:

• Subset selection
• Shrinkage
• Dimensionally Reduction

Each of these methods will reduce your feature set down to achieve an optimal evaluation metric. This article will mostly discuss Subset Selection algorithms.

Subset Selection

These methods usually either start w/ all features or no features and systematically remove or add respectively in order to decrease the validation set error or the CV score.

Best Subset Selection

For Best Subset Selection we train and evaluate on the validation set $2^p$ models where $p$ is the number of predictors/features available. That is we fit all $p$ models that contain exactly: one predictor, all $\binom{p}{2}=p(p-1)/2$ models that contain exactly two predictors, and so forth. Then we choose the model w/ the best validation set score.

However, as this method is computational inefficient the below two stepwise methods are preferred.

Forward Stepwise Selection

While Best Subset above considers $2^p$ possible models content subsets of the $p$ predictors, forward stepwise condors a much smaller set of models. Forward stepwise selected begins w/ a model w/ no predictors, and then one-at-anime adds predictors until all the predictors are present. At each step it measures which predictor contributes the most the validation score and keeps that predictor. Then at the end of the $p$ steps it chooses the model w/ the highest validation score.

Compared to best subset, this model only takes $1+p(p+1)/2$ comparisons.

Backward Stepwise Selection

Backward stepwise selection is the opposite of Forward Stepwise selection in that it starts w/ all $p$ predictors and one-at-a-time removes the least useful predictor.

Hybrid Approches

A hybrid approach is to incrementally add variables to the model, but after adding cheese wen variable, also you should remove any variable that tho longer provide any improvement in they model fit. This attempts to mimic best subset selection while retaining the computation advantages of forward an backward stepwise selection.

Scikit Learn

There is currently a Pull Request in scikit to add the stepwise feature selection functionality:

https://github.com/scikit-learn/scikit-learn/pull/8684

Shrinkage

L2 Ridge

Shrinks all the coefficients towards zero, but ti will not set any of them exactly to zero. Shrinking the coefficients estimate can significantly reduce their variance.

L1 Norm/Lasso

Linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero. When the goal is to reduce the dimensionality of the data to use with another classifier, they can be used along with Lasso for regression.

Tree Models

Decision Tree

W/ tree’s you just have to think about how the splits are made — you calc the entropy of the feature, therefore you have some sort of importance measure.

from sickit docs:

The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

Dimensionality Reduction

These methods transform the original predictors into into a new subset. This is typically one using Principal Component Analysis in which you can choose the top K new features w/ the highest variance.