Bias & Variance
Bias and variance are the two components of imprecision in predictive models, and in general there is a trade-off between them, so normally reducing one tends to increase the other. Bias in predictive models is a measure of model rigidity and inflexibility, and means that your model is not capturing all the signal it could from the data. Bias is also known as under-fitting. Variance on the other hand is a measure of model inconsistency, high variance models tend to perform very well on some data points and really bad on others. This is also known as over-fitting and means that your model is too flexible for the amount of training data you have and ends up picking up noise in addition to the signal, learning random patterns that happen by chance and do not generalize beyond your training data.
If your model is performing really well on the training set, but much poorer on the hold-out set, then it’s suffering from high variance. On the other hand if your model is performing poorly on both training and test data sets, it is suffering from high bias.
- Works well if data is linearly separable and correlated; Low variance algorithm – highly generalizes data
- Simple cost function
- adaptable for online learning (stochastic GD)
- useful output probabilities for each class
Regularization is the process of penalizing model complexity during training. There are two popular techniques: L1 Lasso or L2 Ridge:
- L1 attempts remove features w/ low variance by pulling the weight towards 0. L1 is good for sparse datasets. In the end only a few features are relevant.
- L2 attempts to create an even distribution of weights which helps dense datasets avoid overfitting b/c noisy terms cannot dominate. In the end few features are more significant than the others.
Tradeoff: How do we value more: accuracy vs. complexity?
Decision Tree Classifier
- Good for non-linearly-seperable data (go to SVM (with a non linear kernel RBF) if not seperable)
- Prone to overfitting – obvious if you visualize decision surface. Rand forest helps this
- high variance / low bias (use bagging to alleviate see random forrest)
- lower-dimensional data is better / high dimensional will result in deep tree that is computational expensive
*Hard to run on numerical values
- Easy to visualize decision surface and view outcome probabilities (not a black box)
Boosting relies on training several (simple, usually decision stumps) models successively each trying to learn from the errors of the models preceding it. Boosting differes from other algorithms in that in addition to it gives weights to training examples (as opposed to linear models which apply weights on features). So in essence we can weight the scarcer observations more heavily than the more populous ones. Boosting decreases bias and hardly affects variance (unless you are very sloppy). Depending on your
n_estimators paramenter you are adding another inner-loop to your training step, so the the price of AdaBoost is an exensive jump in computational time and memory.
Bagging is short for Bootstrap Aggregation. It uses several versions of the same model trained on slightly different samples of the training data to reduce variance without any noticeable effect on bias. Bagging could be computationally intensive esp. in terms of memory.
The intuition behind bagging is that averaging a set of observations reduces the variance. Given $Z_n$ observations w/ variance $\sigma^2$, the variance of the mean Z is given by $\sigma^2/n$. Hence a natural way to reduce the variance and hence increasing the predicting accuracy of a model is too take many samples of the training set and average the resulting predictions.
Examples: Random Forrest
Bagging vs Boosting
Random Forest is bagging instead of boosting. In boosting, we allow many weak classifiers (high bias with low variance) to learn form their mistakes sequentially with the aim that they can correct their high bias problem while maintaining the low-variance property. In bagging, we use many overfitted classifiers (low bias but high variance) and do a bootstrap to reduce the variance.
Random Forrest Classifier
- Averages D-trees to lower variance
- fast to train on large datasets compared to SVM
Generative model. Common technique for text classification. Build probability distribution for each word in the corpus relative to the targets.
- Good for small training set is small, (high bias/low variance)
- Text preprocessing helps: Stemming, pruning etc
- Picking/finding the right kernel can be a challenge
- Results/output are incomprehensible
- No standardized way for dealing with multi-class problems; fundamentally a binary classifier
- Can fit non-linear decision surfaces though kernal trick
The theory, in a nutshell, is that you have multiple models that you blend together to reduce variance and make your predictions more stable. That’s how a random forest works, its is a combination of n_estimators decision tree models that use majority voting (in the case of Random Forest Classifier) or straight averaging (in the case of Random Forest Regressor). Random Forests are called Bagging Meta-Estimators. Bagging is a method to reduce variance by introducing randomness into selection of the best feature to use at a particular split. Bagging estimators work best with when you have very deep levels on the decision trees. The randomness prevents overfitting the model on the training data.