# Consumer Credit Risk Models via Machine-Learning Algorithms

Alex Egg,

I was recently talking to a company that is modeling credit worthiness from checking account data instead of the standard credit report method. One of the use-cases for this is the underserved segment of migrants or young people. I asked where the initial hypothesis for the correlation came from and they referred me to this paper: https://dspace.mit.edu/openaccess-disseminate/1721.1/66301

It’s a very interesting study from some MIT researches in which they were actually given a anonymized dataset from real bank customers. In the paper they propose a cardinal measure of consumer credit risk that combines traditional credit factors sucks as debt-to-income rations with consumer banking transactions. They go on to show a non-linear correlation between credit score and default frequency and go on to identify a few trends in the data that lead up to the 2008 Financial Crisis.

[Figure 5]

For example the increasing debt/income ratio leading up the crisis.

## Data

Show all possbile data

## Feature Engineering

### Negative Income Shocks

One key feature they were able to pull out of the data is called “Income Shock”. “A sharp and sudden drop in income – most likely due to unemployment – is a plausible potential predictor of future delinquency and default.” “The challenge is to identify income shocks that are significant ie, large relative to his/her historical levels and variability of income…. we construct an income-shock variable by taking the difference between the current month’s income and the 6-month, and then divide this difference by the standard deviation of income over the same trailing 6-month window.”

[Figure 9]
The gap between the delinquency rates for individuals w/ and w/o income shock is clear.

## Model

They want to build a model that will predict a probability estimate of 90-days-or-more delinquency within the next 3 months, given specific input variables from that customers account.

### Decision Tree

Their initial model is based on a decision tree b/c it can fit non-linear correlations in the data and because of a more practical reason: decision trees are easy to visualize and see the probability outcomes as opposed to other methods that are more of a black-box which might be viewed w/ skepticism in a financial environment.

#### Boosting

The dataset is highly skewed: only 2% of the data represents defaults. One technique to deal w/ this is called Boosting. Instead of equally weighting all observations in the training set, we can weight the scarcer observations more heavily than the common ones. Boosting was popularized by Shapire and my professor at UCSD, Freund.

### Features

The final dataset is combination of selected fields from the three original datasets: Credit Bureau, Transactions & Deposits

#### Credit Bureau Data

• Total Number of Trade Lines
• Number of Open Trade Lines
• Number of closed trade lines
• Number and balance of auto loans
• Number and balance of credit cards
• Number and balance of home line of credits
• Number and balance of home loans
• Number and balance of all other loans
• Number and balance of all other lines of credit
• Number and balance of all mortgages

• Balance of all auto loans to total debt
• Balance of all credit cards to total debt
• Balance of all home line of credit to total debt
• Balance of all home loans to total debt
• Balance of all other loans to total debt
• Balance of all other lines of credit to total debt
• Ratio of total mortgage balance to total debt

• Total credit-card balance to limits
• Total home line of credit balances to limits
• Total balance on all other lines of credit to limits

#### Transaction Data

There is a considerable amount of pre-processing required for this dataset, because originally it is simply a list of transations for each user. In order to get it in shape for this training set it all had to be aggregated per user to get the averages and totals below.

Another difficult technical aspect of this dataset, that may potentialbly be overlooked, is grouping/aggregating transaction by type. For example, we have a total grocery expenses metric below, but that is not a trivial thing to establish. How can we know if a given transaction belongs to a given category, in this case grocery. I know some banks will try to do the categorization, but some of my personal accounts do not. One consumber based company that does this well is Mint: they have pretty accurate transaction classifiying routines in place.

• Number of Transactions
• Total inflow
• Total outflow
• Total pay inflow

• Total all food related expenses
• Total grocery expenses
• Total restaurant expenses
• Total fast food expenses
• Total bar expenses
• Total expenses at discount stores
• Total expenses at big-box stores
• Total recreation expenses
• Total clothing stores expenses
• Total department store expenses
• Total other retail stores expenses

• Total utilities expenses
• Total cable TV & Internet expenses
• Total telephone expenses

• Total net flow from brokerage account
• Total net flow from dividends and annuities

• Total gas station expenses
• Total vehicle related expenses

• Total logging expenses
• Total travel expenses

• Total credit-card payments
• Total mortgage payments
• Total outflow to car and student loan payments

• Total education related expenses

#### Deposit Data

• Savings account balance

• Checking account balance

• CD account balance

• Brokerage account balance