Feature Selection

Alex Egg,

There are a few automated techniques for feature selection in machine learning:

Each of these methods will reduce your feature set down to achieve an optimal evaluation metric. This article will mostly discuss Subset Selection algorithms.

Subset Selection

These methods usually either start w/ all features or no features and systematically remove or add respectively in order to decrease the validation set error or the CV score.

Best Subset Selection

For Best Subset Selection we train and evaluate on the validation set $2^p$ models where $p$ is the number of predictors/features available. That is we fit all $p$ models that contain exactly: one predictor, all $\binom{p}{2}=p(p-1)/2$ models that contain exactly two predictors, and so forth. Then we choose the model w/ the best validation set score.

However, as this method is computational inefficient the below two stepwise methods are preferred.

Forward Stepwise Selection

While Best Subset above considers $2^p$ possible models content subsets of the $p$ predictors, forward stepwise condors a much smaller set of models. Forward stepwise selected begins w/ a model w/ no predictors, and then one-at-anime adds predictors until all the predictors are present. At each step it measures which predictor contributes the most the validation score and keeps that predictor. Then at the end of the $p$ steps it chooses the model w/ the highest validation score.

Compared to best subset, this model only takes $1+p(p+1)/2$ comparisons.

Backward Stepwise Selection

Backward stepwise selection is the opposite of Forward Stepwise selection in that it starts w/ all $p$ predictors and one-at-a-time removes the least useful predictor.

Hybrid Approches

A hybrid approach is to incrementally add variables to the model, but after adding cheese wen variable, also you should remove any variable that tho longer provide any improvement in they model fit. This attempts to mimic best subset selection while retaining the computation advantages of forward an backward stepwise selection.

Scikit Learn

There is currently a Pull Request in scikit to add the stepwise feature selection functionality:



L2 Ridge

Shrinks all the coefficients towards zero, but ti will not set any of them exactly to zero. Shrinking the coefficients estimate can significantly reduce their variance.

L1 Norm/Lasso

Linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero. When the goal is to reduce the dimensionality of the data to use with another classifier, they can be used along with Lasso for regression.

Tree Models

Decision Tree

W/ tree’s you just have to think about how the splits are made — you calc the entropy of the feature, therefore you have some sort of importance measure.

from sickit docs:

The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [R245].

Dimensionality Reduction

These methods transform the original predictors into into a new subset. This is typically one using Principal Component Analysis in which you can choose the top K new features w/ the highest variance.

Deep learning for feature selection

Related to feature selection is feature engineering in which we transform the original predictors of our data into more meaningful representations.

Some concrete examples of feature engineering are often experienced by the hello world of machine learning: the titanic dataset. This dataset is often first encountered on Kaggle and novices will quickly learn that the key to getting a high score on this challenge is rooted in feature engineering. For example:

None of the above features were explicitly in the dataset originally, however, we used domain knowledge to expose them explicitly. One argument that François Chollet makes in his new book In “Deep Learning with Python” is that w/ deep learning, the need to create this high-level representations by hand is no longer so important.

Chollet characterizes machine learning problems as efforts to find structure in data and then learn from that structure. He categorizes two types of machine learning problems: structured and perceptual. Structured problems are classic ML problems which can be contrasted w/ perceptual problems such as vision, or translation which deep learning seems to excel.

He goes on the describe the process of feature engineering in shallow learning in which the practitioner attempts to build structure out of the raw data features and then goes on the describe how this is less involved of a task with deep learning.

Since the SVM is a “shallow” method, applying SVM to perceptual problems requires first extracting useful representation manly (a step called “Feature engineering”), which is difficult and brittle. [1]

Then he goes on to describe the benefits of deep learning w/ regards to feature engineering:

Deep learning is also making problem-solving much easier, because it completely automates what used to be the most crucial step in machine learning workflow; feature engineering. [2]

and he goes on describe how the layers of a deep network model these representations automatically that use used to do by hand:

As such, humans had to go to great length to make initial input data, more amenable to processing by these methods, i.e. they had to manually entire good layer of represents for they data. This is what is called “feature engineering”. Deep learning, on the other hand, completely automates this step: with deep learning, you learn all features in one pass rather then having to engineer them. This has greatly simplified machine learning workflows, often replacing very complicated multi-stage pipelines with single, simple end-to-end deep learning model. [2]

In general, deep learning removed the need for feature engineering.

Convolution Neural Nets

Another interesting application of deep learning for feature engineering is in regard to CNNs. CNNs learn high-level representations of images that are more descriptive than pixel data. Typically a CNN architecture will have a softmax at the end to label the images, but one technique is to remove the end and output the features describing the images for external use. Ruining & McAuley do this in (VBPR) Visual Bayesian Personalized Ranking [3] where they use CNN image features from amazon product images as features for a Collaborative Filtering Recommender System aimed at improving the cold start problem. Egg, Nagaraj, Remigio, Hesami and McAuley continue this work with CNN image features in HBPR [4] with the addition of extra heuristics to boost the cold start ranking.

The expressiveness of the CNN representation of the images can be visualized in 2 dimensions reduced down from the original CNN features dimensions of 4096.

Figure 1: 4096 Dimensional CNN image features projected into 2 dimensions using t-sne. Fullsize

From figure 1 above, you can see that there is a logical clustering of the images which is encoded by the CNN. You can see a live demo of this here: http://sharknado.eggie5.com/tsne


  1. Francois Cho…, “Deep Learning with Python”
  2. Francois Cho…, “Deep Learning with Python” section 1.2.6
  3. https://arxiv.org/abs/1510.01784
  4. https://sharknado.eggie5.com

Permalink: feature-selection


Last edited by Alex Egg, 2017-06-26 05:42:29
View Revision History