Linear Regression Assumptions
Anscombe’s Quartet
Consider the 4 regression below, all of which have the same summary statistics and R2 score:
%pylab inline
import numpy as np
import numpy as np
import statsmodels.api as sm
from statsmodels import regression, stats
import statsmodels
from scipy import stats
from scipy.stats import pearsonr
# Construct Anscombe's arrays
x1 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
x2 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
x3 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]
# Perform linear regressions on the datasets
slr1 = regression.linear_model.OLS(y1, sm.add_constant(x1)).fit()
slr2 = regression.linear_model.OLS(y2, sm.add_constant(x2)).fit()
slr3 = regression.linear_model.OLS(y3, sm.add_constant(x3)).fit()
slr4 = regression.linear_model.OLS(y4, sm.add_constant(x4)).fit()
# Print regression coefficients, Pearson r, and R-squared for the 4 datasets
print 'Cofficients:', slr1.params, slr2.params, slr3.params, slr4.params
print 'Pearson r:', pearsonr(x1, y1)[0], pearsonr(x2, y2)[0], pearsonr(x3, y3)[0], pearsonr(x4, y4)[0]
print 'R-squared:', slr1.rsquared, slr2.rsquared, slr3.rsquared, slr4.rsquared
# Plot the 4 datasets with their regression lines
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2)
xs = np.arange(20)
ax1.plot(slr1.params[0] + slr1.params[1]*xs, 'r')
ax1.scatter(x1, y1)
ax1.set_xlabel('x1')
ax1.set_ylabel('y1')
ax2.plot(slr2.params[0] + slr2.params[1]*xs, 'r')
ax2.scatter(x2, y2)
ax2.set_xlabel('x2')
ax2.set_ylabel('y2')
ax3.plot(slr3.params[0] + slr3.params[1]*xs, 'r')
ax3.scatter(x3, y3)
ax3.set_xlabel('x3')
ax3.set_ylabel('y3')
ax4.plot(slr4.params[0] + slr4.params[1]*xs, 'r')
ax4.scatter(x4,y4)
ax4.set_xlabel('x4')
ax4.set_ylabel('y4');
Cofficients: [ 3.00009091 0.50009091] [ 3.00090909 0.5 ] [ 3.00245455 0.49972727] [ 3.00172727 0.49990909]
Pearson r: 0.816420516345 0.816236506 0.81628673949 0.816521436889
R-squared: 0.666542459509 0.666242033727 0.666324041067 0.666707256898
Regression Assumptions:
This is a lesson in that you cannon blindly trust the R2 score or other summary statisitcs as a measure of fit goodness. Particularly, you cannot ignore the assumptions of a regression, which are:
- Linear relationship between predictor and response and no multicollinerity
- No auto-correlation (statistical independence of the errors)
- Homoscedasticity (constant variance) of the errors
- Normality of the residual(error) distribution.
I would like to explore the assumption of residual distribution.
Normally Distributed Residuals
You can look at your residuals (loss) as an indicator of good fit. If your residuals do not follow a normal distribution, this is typically a warning sign that your R2 score is dubious.
There are a few statistical tests for Residual Normality, particularly, the Jaque-Bara test is common and available in scipy.
Jarque-Bera
Jarque-Bera performors a hypothsis tests, where the null stats that your residual sample is from normal distribution. The test p-value reflects the probability of accepting the null hypothesis. If it’s too low then you reject it.
Below is an example of the JB tests on a normal and poisson distribution, see how the p-value of the normal and poisson cause us to accept and reject the null respectivly:
residuals = np.random.normal(0, 1, 2000)
_, pvalue, _, _ = statsmodels.stats.stattools.jarque_bera(residuals)
print "normal dist. p-val:\t",pvalue
residuals = np.random.poisson(size = 2000)
_, pvalue, _, _ = statsmodels.stats.stattools.jarque_bera(residuals)
print "poission dist. p-val:\t",pvalue
normal dist. p-val: 0.268794984263
poission dist. p-val: 3.21196579041e-75
As you can see from the test above, that the p-value for the normal distributions is much larger than 5% which the p value for the arbitrary poisson distrubtion is much lower than the 5% baseline.
Conclusion
Although we can’t run the Jarque-Bera test on Anscombe’s quartet above (because there aren’t enough samples), we can see from the example above it should be a good metric to evaluate our model beyond summary statistics, which the Anscombe exmaple proves can be misleading.
Permalink: linear-regression-assumptions
Tags: