Consumer Credit Risk Models via Machine-Learning Algorithms

Alex Egg,

I was recently talking to a company that is modeling credit worthiness from checking account data instead of the standard credit report method. One of the use-cases for this is the underserved segment of migrants or young people. I asked where the initial hypothesis for the correlation came from and they referred me to this paper:

It’s a very interesting study from some MIT researches in which they were actually given a anonymized dataset from real bank customers. In the paper they propose a cardinal measure of consumer credit risk that combines traditional credit factors sucks as debt-to-income rations with consumer banking transactions. They go on to show a non-linear correlation between credit score and default frequency and go on to identify a few trends in the data that lead up to the 2008 Financial Crisis.

[Figure 5]

For example the increasing debt/income ratio leading up the crisis.


Show all possbile data

Feature Engineering

High Balance-to-Income Ratio

Negative Income Shocks

One key feature they were able to pull out of the data is called “Income Shock”. “A sharp and sudden drop in income – most likely due to unemployment – is a plausible potential predictor of future delinquency and default.” “The challenge is to identify income shocks that are significant ie, large relative to his/her historical levels and variability of income…. we construct an income-shock variable by taking the difference between the current month’s income and the 6-month, and then divide this difference by the standard deviation of income over the same trailing 6-month window.”

[Figure 9]
The gap between the delinquency rates for individuals w/ and w/o income shock is clear.


They want to build a model that will predict a probability estimate of 90-days-or-more delinquency within the next 3 months, given specific input variables from that customers account.

Decision Tree

Their initial model is based on a decision tree b/c it can fit non-linear correlations in the data and because of a more practical reason: decision trees are easy to visualize and see the probability outcomes as opposed to other methods that are more of a black-box which might be viewed w/ skepticism in a financial environment.


The dataset is highly skewed: only 2% of the data represents defaults. One technique to deal w/ this is called Boosting. Instead of equally weighting all observations in the training set, we can weight the scarcer observations more heavily than the common ones. Boosting was popularized by Shapire and my professor at UCSD, Freund.


The final dataset is combination of selected fields from the three original datasets: Credit Bureau, Transactions & Deposits

Credit Bureau Data

Transaction Data

There is a considerable amount of pre-processing required for this dataset, because originally it is simply a list of transations for each user. In order to get it in shape for this training set it all had to be aggregated per user to get the averages and totals below.

Another difficult technical aspect of this dataset, that may potentialbly be overlooked, is grouping/aggregating transaction by type. For example, we have a total grocery expenses metric below, but that is not a trivial thing to establish. How can we know if a given transaction belongs to a given category, in this case grocery. I know some banks will try to do the categorization, but some of my personal accounts do not. One consumber based company that does this well is Mint: they have pretty accurate transaction classifiying routines in place.

Deposit Data

Permalink: consumer-credit-risk-models-via-machine-learning-algorithms


Last edited by Alex Egg, 2016-11-16 21:44:34
View Revision History