This notebook is also hosted at:
This article will look at the benefits of using Principle Component Analysis to reduce the dimensionality of our datasesets. We will show that high dimensional datasets can most often times be expressed using only a few dimensions and that reducing dimensionality can make our datasets easier to work with and decrease the training time of our models.
PCA Example on the MNIST Dataset
The MNIST dataset is composed of 28x28 pixel images of handwriten digits. Typically when training machine learning models, MNIST is vectorized by considering all the pixels as a 784 dimensional vector, or in other words, putting all the 28x28 pixels next to each other in an 1x784 array. Not many datasets are greater than 784D (dimensions), and if you have ever worked with high dimensional datasets before we know that training time and prediction time are proportional to the size of the dataset and the number of features (dimensionality).
Before we go a further we should have a small discussion about dimensionalty and why it’s important. In simplistic terms, it is just the number of columns in the dataset, but it has significant downstream effects on the eventual models. At the extremes, the concept of the “curse of dimensionality” (wiki) discusses that in high-dimensional spaces some things just stop working properly. Even in relatively low dimensional problems, a dataset with more dimensions requires more parameters for the model to understand, and that means more rows to reliably learn those parameters. If the number of rows in the dataset is fixed, addition of extra dimensions without adding more information for the models to learn from can have a detrimental effect on the eventual model accuracy.
Let’s take a look at how long it takes to train a Gaussian Classifier on the MNIST dataset:
Train Using Gaussian Classifier
import loader as support #support library to read mnist files into memory import gaussian_classifier as gf import time %pylab inline
Populating the interactive namespace from numpy and matplotlib
X_train, Y_train = support.loadmnist('data/train-images-idx3-ubyte', 'data/train-labels-idx1-ubyte') X_test, Y_test = support.loadmnist('data/t10k-images-idx3-ubyte', 'data/t10k-labels-idx1-ubyte')
As you can see the we have 60,000 training examples with 784 features. Let’s see how long it takes to train and predict on this high dimensional dataset:
clf = gf.GaussianClassifier(c=3400) start = time.time() clf.fit(X_train, Y_train) Y = clf.predict(X_test) errors = (Y_test != Y).sum();total = X_test.shape print("Error rate:\t %d/%d = %f" % ((errors,total,(errors/float(total))))) end = time.time() duration = end-start duration
c=3400, cov_algo=numpy Error rate: 435/10000 = 0.043500 58.36238384246826
On my modest hardware (2016 macbook pro) it took about 1 minute (54 s) to train and predict with a 4.3% error rate – pretty good!
Now let’s experiment and see if we really need all those dimensions???
PCA on MNIST
If we take all of our training examples and look at the variance of each feature, we can build a covariance matrix that will give us a quantifiable way to show variance across examples in a feature.
cov = numpy.cov(X_train.T)
array([[ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], ..., [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.]])
Once we have a convariance matrix, we can compute the eigenvalues and eigenvectors of a square array $\Sigma$ or
cov. These are the so called principal components in PCA:
evals, evecs = numpy.linalg.eig(cov) evals = np.float64(evals) evecs = np.float64(evecs)
Each Eigenvalue represents the variance of that feature. By definition, the total variation is given by the sum of the variances. It turns out that this is also equal to the sum of the eigenvalues of the covariance matrix. Thus, the total variation is:
total_variation = np.sum(evals) np.float64(total_variation)
As we can see from the output above, the total variance of our dataset is:
3428502.5747802104. Now lets look at our eigenvalues as a fraction of the overall variance they should equal up to 1:
explained_variance_ratio =  explained_variance =  for i in range(0,X_train.shape): explained_variance.append(evals[i]) variance = evals[i] / total_variation explained_variance_ratio.append(variance) explained_variance_ratio print sum(explained_variance_ratio) cumulative_explained = cumsum(explained_variance_ratio) plot(cumulative_explained); grid()
This means that each feature has a variance that contributes to the overall variance. Is variance equally distributed among features? Are some features more expressive of the dataset and some not contributing any information? If so, then we can remove those features form our training set.
plt.figure(1, figsize=(8, 6)) plt.clf() plt.axes([.2, .2, .7, .7]) plt.plot(explained_variance_ratio, linewidth=2) plt.axis('tight') plt.xlabel('n_components') plt.ylabel('explained_variance_') grid()
The above figure shows how much variance each feature contributes to the whole. The x axis represents our Eigenvalue/vector pairs which correspond to a feature in the original dataset. We can see that after about 150 vectors, the subsequent vectors don’t contribute much to the whole. We can probably cut out the last 634 vectors w/o losing much accuracy with our classifier while at the same time decreasing training time!
Just to quantify the statement above: If we only use the first 150 vectors in our PCA analysis, we will capture 94% of the overall variance of our dataset and only lose about %5. Let’s check if the we get a speed boost and not too much of an accuracy loss on our classifier
Apply PCA Transform
The output of a PCA process is a system of linear combinations of the data that we will use to transform our original dataset to the reduced dimensional dataset:
from sklearn.decomposition import PCA pca = PCA(n_components=150) pca.fit(X_train)
At this point, we have the same Eigenvalues and Eigenvectors that we generated ourselves in the previous steps. However, now we only selected the top 150 and Scikit calls them
To do the transform to the reduced dimensionality, scikit will do the dot product of our Eigenvectors (components) and our input matrix:
X_transformed = fast_dot(X, self.components_.T)
So applying the PCA transformation is simply a linear combination – very fast. Now we will apply the projection to the training set and any new data that we want to tests against in the future.
X_train_150 = pca.transform(X_train) X_test_150 = pca.transform(X_test)
As you can see, we now have a 150 feature training set w/ only useful information. Lets benchmark it against the classification job we did pre-PCA:
clf = gf.GaussianClassifier(c=1) start = time.time() clf.fit(X_train_150, Y_train) Y = clf.predict(X_test_150) errors = (Y_test != Y).sum(); total = X_test_150.shape print("Error rate:\t %d/%d = %f" % ((errors, total, (errors/float(total))))) end = time.time() duration = end-start duration
c=1, cov_algo=numpy Error rate: 497/10000 = 0.049700 3.105672836303711
Our classifier finished in about 3 seconds and has an error rate of 4.9%. Thus we experienced a 94% decrease in training time and only a .6% loss in accuracy!
|Training Time (s)||Error Rate %|
So as you can see, PCA has some pretty compelling results when applied to machine learning tasks. Reduced dimensionality leads to faster training!
PCA for Data Exploration
PCA is also helpful when trying to visualize data, take the scikit Iris dataset as an example. It’s 4 dimensional training set. If we do PCA on it reduce it to 2D we can now visualize the data:
from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target target_names = iris.target_names pca = PCA(n_components=2) X_r = pca.fit(X).transform(X) # Percentage of variance explained for each components print('explained variance ratio (first two components): %s' % str(pca.explained_variance_ratio_)) plt.figure(2, figsize=(8, 6)) for c, i, target_name in zip("rgb", [0, 1, 2], target_names): plt.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name) plt.legend() plt.title('PCA of IRIS dataset') plt.show()
explained variance ratio (first two components): [ 0.92461621 0.05301557]