Linear Regression Assumptions

Alex Egg,

Anscombe’s Quartet

Consider the 4 regression below, all of which have the same summary statistics and R2 score:

%pylab inline
import numpy as np
import numpy as np
import statsmodels.api as sm
from statsmodels import regression, stats
import statsmodels
from scipy import stats
from scipy.stats import pearsonr

# Construct Anscombe's arrays
x1 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
x2 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
x3 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

# Perform linear regressions on the datasets
slr1 = regression.linear_model.OLS(y1, sm.add_constant(x1)).fit()
slr2 = regression.linear_model.OLS(y2, sm.add_constant(x2)).fit()
slr3 = regression.linear_model.OLS(y3, sm.add_constant(x3)).fit()
slr4 = regression.linear_model.OLS(y4, sm.add_constant(x4)).fit()

# Print regression coefficients, Pearson r, and R-squared for the 4 datasets
print 'Cofficients:', slr1.params, slr2.params, slr3.params, slr4.params
print 'Pearson r:', pearsonr(x1, y1)[0], pearsonr(x2, y2)[0], pearsonr(x3, y3)[0], pearsonr(x4, y4)[0]
print 'R-squared:', slr1.rsquared, slr2.rsquared, slr3.rsquared, slr4.rsquared

# Plot the 4 datasets with their regression lines
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2)
xs = np.arange(20)
ax1.plot(slr1.params[0] + slr1.params[1]*xs, 'r')
ax1.scatter(x1, y1)
ax1.set_xlabel('x1')
ax1.set_ylabel('y1')
ax2.plot(slr2.params[0] + slr2.params[1]*xs, 'r')
ax2.scatter(x2, y2)
ax2.set_xlabel('x2')
ax2.set_ylabel('y2')
ax3.plot(slr3.params[0] + slr3.params[1]*xs, 'r')
ax3.scatter(x3, y3)
ax3.set_xlabel('x3')
ax3.set_ylabel('y3')
ax4.plot(slr4.params[0] + slr4.params[1]*xs, 'r')
ax4.scatter(x4,y4)
ax4.set_xlabel('x4')
ax4.set_ylabel('y4');
Cofficients: [ 3.00009091  0.50009091] [ 3.00090909  0.5       ] [ 3.00245455  0.49972727] [ 3.00172727  0.49990909]
Pearson r: 0.816420516345 0.816236506 0.81628673949 0.816521436889
R-squared: 0.666542459509 0.666242033727 0.666324041067 0.666707256898

png

Regression Assumptions:

This is a lesson in that you cannon blindly trust the R2 score or other summary statisitcs as a measure of fit goodness. Particularly, you cannot ignore the assumptions of a regression, which are:

I would like to explore the assumption of residual distribution.

Normally Distributed Residuals

You can look at your residuals (loss) as an indicator of good fit. If your residuals do not follow a normal distribution, this is typically a warning sign that your R2 score is dubious.

There are a few statistical tests for Residual Normality, particularly, the Jaque-Bara test is common and available in scipy.

Jarque-Bera

Jarque-Bera performors a hypothsis tests, where the null stats that your residual sample is from normal distribution. The test p-value reflects the probability of accepting the null hypothesis. If it’s too low then you reject it.

Below is an example of the JB tests on a normal and poisson distribution, see how the p-value of the normal and poisson cause us to accept and reject the null respectivly:

residuals = np.random.normal(0, 1, 2000)

_, pvalue, _, _ = statsmodels.stats.stattools.jarque_bera(residuals)
print "normal dist. p-val:\t",pvalue


residuals = np.random.poisson(size = 2000)

_, pvalue, _, _ = statsmodels.stats.stattools.jarque_bera(residuals)
print "poission dist. p-val:\t",pvalue
normal dist. p-val:	0.268794984263
poission dist. p-val:	3.21196579041e-75

As you can see from the test above, that the p-value for the normal distributions is much larger than 5% which the p value for the arbitrary poisson distrubtion is much lower than the 5% baseline.

Conclusion

Although we can’t run the Jarque-Bera test on Anscombe’s quartet above (because there aren’t enough samples), we can see from the example above it should be a good metric to evaluate our model beyond summary statistics, which the Anscombe exmaple proves can be misleading.

Permalink: linear-regression-assumptions

Tags:

Last edited by Alex Egg, 2017-04-17 20:49:03
View Revision History