# Linear Regression Assumptions

Alex Egg,

## Anscombe’s Quartet

Consider the 4 regression below, all of which have the same summary statistics and R2 score:

%pylab inline
import numpy as np
import numpy as np
import statsmodels.api as sm
from statsmodels import regression, stats
import statsmodels
from scipy import stats

from scipy.stats import pearsonr

# Construct Anscombe's arrays
x1 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
x2 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
x3 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

# Perform linear regressions on the datasets

# Print regression coefficients, Pearson r, and R-squared for the 4 datasets
print 'Cofficients:', slr1.params, slr2.params, slr3.params, slr4.params
print 'Pearson r:', pearsonr(x1, y1)[0], pearsonr(x2, y2)[0], pearsonr(x3, y3)[0], pearsonr(x4, y4)[0]
print 'R-squared:', slr1.rsquared, slr2.rsquared, slr3.rsquared, slr4.rsquared

# Plot the 4 datasets with their regression lines
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2)
xs = np.arange(20)
ax1.plot(slr1.params[0] + slr1.params[1]*xs, 'r')
ax1.scatter(x1, y1)
ax1.set_xlabel('x1')
ax1.set_ylabel('y1')
ax2.plot(slr2.params[0] + slr2.params[1]*xs, 'r')
ax2.scatter(x2, y2)
ax2.set_xlabel('x2')
ax2.set_ylabel('y2')
ax3.plot(slr3.params[0] + slr3.params[1]*xs, 'r')
ax3.scatter(x3, y3)
ax3.set_xlabel('x3')
ax3.set_ylabel('y3')
ax4.plot(slr4.params[0] + slr4.params[1]*xs, 'r')
ax4.scatter(x4,y4)
ax4.set_xlabel('x4')
ax4.set_ylabel('y4');

Cofficients: [ 3.00009091  0.50009091] [ 3.00090909  0.5       ] [ 3.00245455  0.49972727] [ 3.00172727  0.49990909]
Pearson r: 0.816420516345 0.816236506 0.81628673949 0.816521436889
R-squared: 0.666542459509 0.666242033727 0.666324041067 0.666707256898


## Regression Assumptions:

This is a lesson in that you cannon blindly trust the R2 score or other summary statisitcs as a measure of fit goodness. Particularly, you cannot ignore the assumptions of a regression, which are:

• Linear relationship between predictor and response and no multicollinerity
• No auto-correlation (statistical independence of the errors)
• Homoscedasticity (constant variance) of the errors
• Normality of the residual(error) distribution.

I would like to explore the assumption of residual distribution.

### Normally Distributed Residuals

You can look at your residuals (loss) as an indicator of good fit. If your residuals do not follow a normal distribution, this is typically a warning sign that your R2 score is dubious.

There are a few statistical tests for Residual Normality, particularly, the Jaque-Bara test is common and available in scipy.

#### Jarque-Bera

Jarque-Bera performors a hypothsis tests, where the null stats that your residual sample is from normal distribution. The test p-value reflects the probability of accepting the null hypothesis. If it’s too low then you reject it.

Below is an example of the JB tests on a normal and poisson distribution, see how the p-value of the normal and poisson cause us to accept and reject the null respectivly:

residuals = np.random.normal(0, 1, 2000)

_, pvalue, _, _ = statsmodels.stats.stattools.jarque_bera(residuals)
print "normal dist. p-val:\t",pvalue

residuals = np.random.poisson(size = 2000)

_, pvalue, _, _ = statsmodels.stats.stattools.jarque_bera(residuals)
print "poission dist. p-val:\t",pvalue

normal dist. p-val:	0.268794984263
poission dist. p-val:	3.21196579041e-75


As you can see from the test above, that the p-value for the normal distributions is much larger than 5% which the p value for the arbitrary poisson distrubtion is much lower than the 5% baseline.

## Conclusion

Although we can’t run the Jarque-Bera test on Anscombe’s quartet above (because there aren’t enough samples), we can see from the example above it should be a good metric to evaluate our model beyond summary statistics, which the Anscombe exmaple proves can be misleading.