- Works well if data is linearly separable and correlated; Low variance algorithm – highly generalizes data
- Simple cost function
- adaptable for online learning (stochastic GD)
- useful output probabilities for each class
Regularization is the process of penalizing model complexity during training. There are two popular techniques: L1 Lasso or L2 Ridge:
- L1 attempts remove features w/ low variance by pulling the weight towards 0. L1 is good for sparse datasets. In the end only a few features are relevant.
- L2 attempts to create an even distribution of weights which helps dense datasets avoid overfitting b/c noisy terms cannot dominate. In the end few features are more significant than the others.
Tradeoff: How do we value more: accuracy vs. complexity?
Decision Tree Classifier
- Good for non-linearly-seperable data (go to SVM (with a non linear kernel RBF) if not seperable)
- Prone to overfitting (high variance / low bias) – obvious if you visualize decision surface. Use bagging to alleviate: see random forrest
- lower-dimensional data is better / high dimensional will result in deep tree that is computational expensive
- Hard to run on numerical values
- Very Interpretable
- Easy to visualize decision surface and view outcome probabilities (not a black box)
- Tree Visualization
Generative model. Common technique for text classification. Build probability distribution for each word in the corpus relative to the targets.
- Good for small training set is small, (high bias/low variance)
- Text preprocessing helps: Stemming, pruning etc
- Picking/finding the right kernel can be a challenge
- Results/output are incomprehensible
- No standardized way for dealing with multi-class problems; fundamentally a binary classifier
- Can fit non-linear decision surfaces though kernal trick
In practice, it turns out that instead choosing the best model out of a set of models, if we combine many various, the results are better — often much better — and at little extra effort.
Creating such model ensembles is now standard. In the simplest technique, called bagging, we simply generate radom variates of the training set by revamping, learn a classier on each, and combine the results by voting. This works because it greatly reduces variance while only slightly increasing bias. In boosting, training examples have weights, and thse are varied so that each new classifier focus on the exams the previous ones tended to get wrong. In stacking, the outputs of individual classifiers come the inputs of a “higher-level” larger that figure out how best to combine them.
The theory bagging, in a nutshell, is that you have multiple models that you blend together to reduce variance and make your predictions more stable. That’s how a random forest works, it is a combination of
n_estimators decision tree models that use majority voting (in the case of Random Forest Classifier) or straight averaging (in the case of Random Forest Regressor). Random Forests are called Bagging Meta-Estimators. Bagging reduces variance by introducing randomness into selection of the best feature to use at a particular split. Bagging estimators work best with when you have very deep levels on the decision trees. The randomness prevents overfitting the model on the training data.
In bagging, we use many overfitted classifiers (low bias but high variance) and do a bootstrap to reduce the variance.
Bagging is short for Bootstrap Aggregation. It uses several versions of the same model trained on slightly different samples of the training data to reduce variance without any noticeable effect on bias. Bagging could be computationally intensive esp. in terms of memory.
The intuition behind bagging is that averaging a set of observations reduces the variance. Given $Z_n$ observations w/ variance $\sigma^2$, the variance of the mean Z is given by $\sigma^2/n$. Hence a natural way to reduce the variance and hence increasing the predicting accuracy of a model is too take many samples of the training set and average the resulting predictions.
Examples: Random Forrest
Random Forrest Classifier
- Averages D-trees to lower variance
- fast to train on large datasets compared to SVM
In boosting, we allow many weak classifiers (high bias with low variance) to learn form their mistakes sequentially with the aim that they can correct their high bias problem while maintaining the low-variance property.
Examples: AdaBoost, GradientBoost
Boosting relies on training several (simple, usually decision stumps) models successively each trying to learn from the errors of the models preceding it. Boosting differes from other algorithms in that in addition to it gives weights to training examples (as opposed to linear models which apply weights on features). So in essence we can weight the scarcer observations more heavily than the more populous ones. Boosting decreases bias and hardly affects variance (unless you are very sloppy). Depending on your
n_estimators paramenter you are adding another inner-loop to your training step, so the the price of AdaBoost is an exensive jump in computational time and memory.
Bagging vs Boosting
Random Forest is bagging instead of boosting. In boosting, we allow many weak classifiers (high bias with low variance) to learn form their mistakes sequentially with the aim that they can correct their high bias problem while maintaining the low-variance property. In bagging, we use many overfitted classifiers (low bias but high variance) and do a bootstrap to reduce the variance.