If you have ever worked w/ financial data you have probably seen an unbalanced dataset. When you deal w/ loan defaults or fraud data, the examples of non-defaults and non-fraud far outweigh opposite.
We’ll use an example the task to identify fraud in the case study below:
In this scenario, you want to build a classifier that can identify fraudulent transactions in credit card histories. Fortunately, most transactions are legitimate, so perhaps only 0.1% of the data is a positive instance. The problem refers to the fact that for a large number of real world problems, the number of positive examples is dwarfed by the number of negative examples (or vice versa).
Imbalanced data is a problem because machine learning algorithms are too smart for your own good. For most learning algorithms, if you give them data that is 99.9% negative and 0.1% positive, they will simply learn to always predict negative. Why? Because they are trying to minimize error, and they can achieve 0.1% error by doing nothing! If a teacher told you to study for an exam with 1000 true/false questions and only one of them is true, it is unlikely you will study very long.
Really, the problem is not with the data, but rather with the way that you have defined the learning problem. That is to say, what you care about is not accuracy: you care about something else. If you want a learning algorithm to do a reasonable job, you have to tell it what you want!
The figure above is a clear example where using a typical accuracy score to evaluate our classification algorithm. For example, if we just used a majority class to assign values to all records, we will still be having a high accuracy, but we would be classifying all
1 (fraud) incorrectly!
Most likely, what you want is not to optimize accuracy, but rather to optimize some other measure, like f-score or AUC. You want your algorithm to make some positive predictions, and simply prefer those to be “good.” We will shortly discuss two heuristics for dealing with this problem: subsampling and weighting. In subsampling, you throw out some of your negative examples so that you are left with a balanced data set (50% positive, 50% negative). This might scare you a bit since throwing out data seems like a bad idea, but at least it makes learning much more efficient. In weighting, instead of throwing out positive examples, we just give them lower weight. If you assign an of 0.00101 to each of the positive examples, then there will be as much weight associated with positive examples as negative examples.
Collect more data, however, this is not always possible.
Instead of overall accuracy, consider optimizing the following performance metrics:
- Use the confusion matrix to calculate Precision and particularly Recall
- F1score (weighted average of precision recall)
- Use Kappa - which is a classification accuracy normalized by the imbalance of the classes in the data
- ROC curves - calculates sensitivity/specificity ratio.
Essentially this is a method that will process the data to have an approximate 50-50 ratio.
Adding copies of the under-represented class (better when you have little data). The main advantage to the over-sampling algorithm is that it does not throw out any data.
Deletes instances from the over-represented class (better when he have lot’s of data). The main advantage to subsampling is that it is more computationally efficient.
Apart from under and over sampling, there is a very popular approach called SMOTE (Synthetic Minority Over-Sampling Technique), which is a combination of oversampling and under-sampling, but the oversampling approach is not by replicating minority class but constructing new minority class data instance via an algorithm.
Add a term $\alpha$ to the cost function to more heavily penalize misclassifications of the minority class.
Subsampling/Under-sampling in Fraud Example
This example in python will under sample the dataset to create a balanced 50/50 ratio. This will be done by randomly selecting $x$ amount of sample from the majority class (not fraud), being $x$ the total number of records with the minority class (is fraud).
# Number of data points in the minority class number_records_fraud = len(data[data.Class == 1]) fraud_indices = np.array(data[data.Class == 1].index) # Picking the indices of the normal classes normal_indices = data[data.Class == 0].index # Out of the indices we picked, randomly select "x" number (number_records_fraud) random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False) random_normal_indices = np.array(random_normal_indices) # Appending the 2 indices under_sample_indices = np.concatenate([fraud_indices,random_normal_indices]) # Under sample dataset under_sample_data = data.iloc[under_sample_indices,:] X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class'] y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class'] # Showing ratio print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data)) print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data)) print("Total number of transactions in resampled data: ", len(under_sample_data))
Percentage of normal transactions: 0.5 Percentage of fraud transactions: 0.5 Total number of transactions in resampled data: 984
After this when we train a model, we are very interested in the recall score, because that is the metric that will help us try to capture the most fraudulent transactions. If you think how Accuracy, Precision and Recall work for a confusion matrix, recall would be the most interesting:
- Accuracy = (TP+TN)/total
- Precision = TP/(TP+FP)
- Recall = TP/(TP+FN)
As we know, due to the imbalance of the data, many observations could be predicted as False Negatives, being, that we predict a normal transaction, but it is in fact a fraudulent one. Recall captures this.
If you train you model on the sampled dataset, then you can test your model on the original unbalanced dataset and achieve a higher Recall rate.