When we want to learn from data (machine learning) it is common practice to reserve a “hold-out set”/”test-set” of your original dataset that can be used for final evaluation or your “out-of-sample” accuracy. When we evaulate our models we often compare and contrast in-sample and out-of-sample validation scores to quanitfy overfitting.
However, there is a school of thought that argures that every time model parameters are updated in resonse to the hold-out set, you are slowly leaking information from the hold-out set to the training-set and eventually you will learn the noise of the hold-out set.
There is another more sinister incarnation of this information leaking phenomon that can happen in your pre-processing step: data scaling. Many models such as the linear models, SVM or Neural Nets are sensative to data scale issues, therefore it is often recommend to scale your data. For example, scale each attribute on the input vector X to [0, 1] or [-1, +1], or standardize it to have mean 0 and variance 1. Note that you must apply the same scaling to the test set for meaningful results. In sci-kit, you can use StandardScaler for standardization.
However, if you scale and then split your train/test set you already leaked information!, e.g don’t do this:
from sklearn import datasets iris = datasets.load_iris() from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaled = scaler.fit_transform(iris.features) from sklearn.cross_validation import train_test_split #leaking data from test set into training set here! X_train, X_test, Y_train, Y_test = train_test_split(iris.features, iris.targets, test_size=0.4, random_state=4)
This model will may have decent in-sample and out-of-sample validation results but when you put it into production you might notice it doesn’t perform according to your Evaulation metrics. Why didn’t the good test performance continue on the new data? In this case, there is a simple explanation: information leakage. Although you were careful enough to set aside a test set that was not used for training, the above approach is flawed in a subtle way. When the orignal data was normalized to zero mean and unit variance, all of the data was involved in this step. Therefore, the test data that was split had already contributed to the choices made by the learning algorithm by contributing to the values of the mean and the variance that were used in normalization.
It was not normalization that was the bad idea. It was the involvement of the test data in the same normalization step which contaminated the data.
Instead, scale the data independtly on your train and test set:
scaler = StandardScaler() # Don't cheat - fit only on training data scaler.fit(X_train) X_train = scaler.transform(X_train) # apply same transformation to test data X_test = scaler.transform(X_test)
If you want an unbiased assesment of your learning perfomance, you should keep a test set in a vault and never use it for learning in any way.