Among competing hypotheses, the one with the fewest assumptions should be selected
What is a “complex” vs “simple” hypothesis?
A “simple” model is one where $\theta$ has few non-zero parameters. i.e.: only a few features are relevant
A “simple” model is one where $\theta$ is almost uniform. i.e: few features are significantly more relevant than the others
Regularization is the process of penalizing model complexity during training
Models w/ high bias will have a tendency to overfit the data or to learn the noise. One method to combat this is called regularization. You basically add a term to your cost function that penalizes large weights, which, in effect penalizes model complexity in training.
Equation 1 is your standard optimization problem that is often implemented as gradient descent. The First term is the standard residual measure that we are trying to minimize. The second term is the interesting part: the regularization. The coefficient $\lambda$ or sometimes $\alpha$ is simply a parameter we have to optimize. The interesting part is the norm of the weights vector: this is called the l2 Norm.
Consider the vectors $a=(0.5, 0.5)$ and $b=(-1,0)$. We can compute the L1 and L2 norms:
As you can see, the two vectors are equivalent with respect to the L1 norm, however, the are different w/ respect to teh L2 norm. This is because squaring the number punishes large values more than small values.
Often times the equation 1 is called “Tikhonov regularization” in academia or called Ridge in machine learning circles. For example, it is implemented as Ridge Regression in scikit. Ridge Regression really wants small value in all slots of $\theta$, whereas solving the L1 version doesn’t care if it’s large values or not.
So we have stated the L2 regularization helps to remove variance across the weights. Lets take a look at that in practice by comparing it to an unregularized linear regression.
# Code source: Gaël Varoquaux # Modified by Alex Egg 12/15/16 # License: BSD 3 clause import numpy as np import matplotlib.pyplot as plt %matplotlib inline from sklearn import linear\_model
X_train = np.c_[.5, 1].T y_train = [.5, 1] X_test = np.c_[0, 2].T np.random.seed(0) classifiers = dict(ols=linear_model.LinearRegression(), ridge=linear_model.Ridge(alpha=.1)) fignum = 1 for name, clf in classifiers.items(): fig = plt.figure(fignum, figsize=(4, 3)) plt.clf() plt.title(name) ax = plt.axes([.12, .12, .8, .8]) for _ in range(6): this_X = .1 * np.random.normal(size=(2, 1)) + X_train clf.fit(this_X, y_train) ax.plot(X_test, clf.predict(X_test), color='.5') ax.scatter(this_X, y_train, s=3, c='.5', marker='o', zorder=10) clf.fit(X_train, y_train) ax.plot(X_test, clf.predict(X_test), linewidth=2, color='blue') ax.scatter(X_train, y_train, s=30, c='r', marker='+', zorder=10) ax.set_xticks(()) ax.set_yticks(()) ax.set_ylim((0, 1.6)) ax.set_xlabel('X') ax.set_ylabel('y') ax.set_xlim(0, 2) fignum += 1 plt.show()
“Due to the few points in each dimension and the straight line that linear regression uses to follow these points as well as it can, noise on the observations will cause great variance as shown in the first plot. Every line’s slope can vary quite a bit for each prediction due to the noise induced in the observations.
Ridge regression is basically minimizing a penalized version of the least-squared function. The penalizing shrinks the value of the regression coefficients. Despite the few data points in each dimension, the slope of the prediction is much more stable and the variance in the line itself is greatly reduced, in comparison to that of the standard linear regression.”
If you are taking a regression using a linear method, depending on your data you should probably use a type of regularizer, either L1 (Lasso) or L2 (Ridge) to avoid a high number of low-variance features or to avoid features w/ high variance relative to the others, respectively.