Machine Learning Glossary
Information Theory
Shannon Entropy
Information entropy is the average rate at which information is produced by a stochastic source of data.
Where $p_i$ is the probability of class $i$. In the binary classification case, we can rollout the summation to:
Where p and q are the respective categories.
 What is the entropy of a homogenous set?
 $1 \log_21  0\log_20 = 0$
 What is the entropy of a set w/ a 50/50 mix?
 $(.5\log_2.5)  (.5\log_2.5) = 1$
Figure: Distribution of entropy for binary random variable (coin). For a fair coin p=.5, there is maximum entropy.
Graph Search
 DFS: Traverse down branches in lexical order
 Hill: Traverse down branches in using heuristic. Heuristic is shortest path to goal.
 BFS: Exhaustively traverse across branches
 Beam: Heuristic to traverse only best $k$ branches
Recursive DFS tree traversal
def traverse(node, visited):
visited.add(node)
for node in node.children  visited:
traverse(node, visited)
return #leaf
traverse(root, set())
Recursive BFS tree traversal:
q = []
def traverse(node, visited):
while q:
node = q.pop()
visited.add(node)
for node in node.children  visited:
queue.append(node)
traverse(root, set())
Linear Algebra
Norm
Measurement of the magnitude of a vector. Typically used for quantifying model parameter complexity, where complexity is viewed as either vector magnitude or sparseness.
 L0
 L1 (Manhattan): $\norm{x}_1 = \sum_{i=0} \abs{x_i}$
 L2 (Euclidian): $\norm{x}_2 = \sum_{i=0} \sqrt{x_i^2} = \sqrt{x\cdot x…}$
Dot Product
Cosine: normalized dot product (angle between two vectors)
Distance
Distance measure between two vectors $A$ and $B$
 Euclidian: $norm(AB)$ which is to say the size of the difference of the vectors
 Dot Product: $a\cdot b = \norm{A}\norm{B} cos(\theta)$
 Cosine: Normalized dot product : $\theta = 1 \frac{a\cdot b}{\norm{A}\norm{B}}$
Statistics
 Central Limit Theorem: Consider the sum of multiple random variables. The sum of a set of random variables, which is of course itself a random variable, has a distribution that become increasingly Gaussian as the number of terms in the sum increases.
 I think this is useful in hypothesis testing if you run a bunch of random experiments, you can assume the distribution of the results is Normal/Gaussian and then do a ttest.
 Do the hypothesis testing exercise on data quest which I always thought was nice!
 Variance: how one variable varies with itself: $\frac{1}{n}\sum_{i=0}^x (x_i  \hat{x})^2$
 Covariance: how one variable varies w/ others: $\Sigma$
 Correlation: Measure of the linear relationship between two variables. Takes value between +1 and −1 inclusive: $Corr(X,Y) = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}$
Probabilities
 Conjunctive: probability of
Event1
ANDevent2
; multiply probabilities  Disjunctive: probability of
Event1
ORevent2
; add probabilities  Conditional: probability of
A
givenB
$P(A\given B)$ : $\frac{P(A \cap B)}{P(B)}$  Bayes Rule: $P(A\given B) = \frac{P(B \given A) P (A)}{P(B)}$
Bayes Rule
Example: There is a cancer test that is 75% accurate (TP Rate/sensitivity):
The prior probably of having cancer in general is $P(Y=1)=.01$
The probability of a positive test $P(X=1)$:
What is the posterior probably $P(Y\given X)$ that you have cancer if you have a positive on the test?
Expectation Notation
For an unfair or weighted coin, the two outcomes are not equally likely. You can change the weight or distribution of the coin by dragging the true probability bars (on the right in blue) up or down. If we assign numbers to the outcomes — say, 1 for heads, 0 for tails — then we have created the mathematical object known as a random variable.
The expectation of a random variable is a number that attempts to capture the center of that random variable’s distribution. It can be interpreted as the longrun average of many independent samples from the given distribution. More precisely, it is defined as the probabilityweighted sum of all possible values in the random variable’s support.
Which equals 3.5 for a dice: $1/6 + 21/6… +61/6$
Variance: Whereas expectation provides a measure of centrality, the variance of a random variable quantifies the spread of that random variable’s distribution. The variance is the average value of the squared difference between the random variable and its expectation,
Distributions
 Skew: measures clustering of distribution. If you have a right skew it’s common to normalize the data (log).
 Kurtosis: measures the shape of distribution
 Modality: measures tendency of distribution
 Mean: measures central tendency but is sensitive to outliers (skew)
 Median: measures central tendency and is less sensitive to outliers (skew)
 Mode: most common number in dist
 Variance, $σ^2$: measures how far the average data point is from the mean: $\sum_i\frac{(\mu  x_i)^2}{\mu}$
 Standard deviation, σ: common way to refer to the distance between data points and the mean.
 Stationary: Mean and variance are constant. Typically seen In timeseries data.
 Normalization: Makes the data stationary meaning for random windows the mean and variance are constant
 Bernoulli Trail: Fail or success
 KL Divergence: Test similarity of two probability distributions. Minimizing the KL loss is generally used in MLE.

Confidence Interval

Normal: We can generate a normal distribution by using a probability density function.
 Parameterized by $\mu$ and $\sigma$

Binomial: sequential binary action: rain/dry, basketball shot/miss, market up/down
 Parameterized by: $n$ and $p$ number of events and prob of success respectively

Unimodal: One mode (normal)

Multimodal: More than 1 mode

Poisson: Number of calls received per hour by call center (when you want a normal w/o negative values…)

Parameterized by $\lambda$


Exponential: a skewed distribution that can model user time on a page in seconds

Parameterized by $\lambda$

Questions
 What is the expected value of flips to get heads on a fair coin?
 $E(x) \sim \text{Geometric}$, Expected value = $\frac{1}{p}$ = , i.e. is 1/.5 = 2
 Given a random Bernoulli trial generator, write a function to return a value sampled from a normal distribution.
 The trick is to sample from the normal CDF which samples from the Normal PDF
Experiment Design
 Null Hypothesis: Status Quo, random chance
 Alternative Hypothesis: statistically significant difference between the outcomes of the 2 groups.
 Onesided: null < x, alt >=x; Twosided (true/false): null=0, alt!=0
 Blind Experimentation:
 Control Group & Treatment Group
 Test Statistic: quantifies the data in the groups
 Statistical Test: Use Test Statistic to ran experiments
 Pvalue: probability quantifying random chance. The lower the better.
 Sample Size: refers to how large the sample for your experiment is. The larger your sample size, the more confident you can be in the result of the experiment (assuming that it is a randomized sample).
 $n = 16 \frac{\sigma^2}{\delta^2}$
 Effect Size/Minimum Detectible Effect $\delta$: The Minimum Detectable Effect is the smallest effect that will be detected $power\%$ of the time.
 refers to the size of the difference in results between the two sample sets and indicates practical significance. If there is a small effect size (say a 0.1% increase in conversion rate) you will need a very large sample size to determine whether that difference is significant or just due to chance. However, if you observe a very large effect on your numbers, you will be able to validate it with a smaller sample size to a higher degree of confidence. %5 is a common number.
 Confidence Interval: product of sample size and effect size. $\mu +Z\frac{\sigma}{\sqrt(n)}$
 ChaiSquared: Does the rate of success differ across two groups?
 Significance: Statistical significance is the likelihood that the difference in conversion rates between a given variation and the baseline is not due to random chance. ie if you run an A/B testing experiment with a significance level of 95%, this means that if you determine a winner, you can be 95% confident that the observed results are real and not an error caused by randomness.
 Significance Level $\alpha$: usually
0.05
 Type I error/Confidence Level (1⍺) — Chance of accepting a difference when there is not. 90%, 95%, 99%(a=.01)
 Type2 error $\beta$  Chance of rejecting an actual lift. β is inversely related to sample size.
 Statistical Power — (1β) is the power or sensitivity of experiment
 Product of Sample Size and Effect Size $\delta$
 Causal Inference???
 Confounding Factors: Causation is not correlation. unknown variables that might be influencing the test. For example, my conclusion is that Shoe Size is correlated w/ Literacy. However, the confounding factor there is age, which is actually correlated w/ literacy.
Figure: A normal/t distribution for the null hypothesis where $\mu=0$ w/ 2 tails highlighted being the regions the sample $\mu$ must lie for the p value to be below 0.05
. You can see the sample mean lies between, 2 and 1 and the goal of the test is to see if this is “weird” or not. It is weird if it is in the tail.
Statistical Tests
 Covariance: relationship between the variances of two variables
 Correlation: two variables change together
 Pearson: rvalue (continuous variables)
 Spearman: ordinal variables
 $\chi^2$ Test: Similarity of two discrete categorical distributions (usually frequency distributions)
 Comparison of Means

Ttest: Tests for the difference between two related variables (normality assumption)

scipy.stats.ttest_ind
https://docs.scipy.org/doc/scipy0.15.1/reference/generated/scipy.stats.ttest_ind.html#scipystatsttestind 
#test w/ idential means >>> rvs1 = stats.norm.rvs(loc=5,scale=10,size=500) >>> rvs2 = stats.norm.rvs(loc=5,scale=10,size=500) >>> stats.ttest_ind(rvs1,rvs2) (0.26833823296239279, 0.78849443369564776) #accept null #Ttest with different means, variance, and n: >>> rvs5 = stats.norm.rvs(loc=8, scale=20, size=100) >>> stats.ttest_ind(rvs1, rvs5) (1.4679669854490653, 0.14263895620529152) #reject null


Welschs ttest: (improvement over classical ttests) comparison of two means (assumes normality)

Ztest: when we know the population $u$ and $\sigma$

Questions
 In the experiment design portion you are asked to create a hypothetical plan to answer testing a new app.
 Identify Goals
 Identify Invariant and evaluation metrics
 Create Hypothesis
 Choose the test statistic
 Choose the statistical test (two sample tests): ttests, $X^2$ test, etc
 Choose significance level ($\alpha$), statistical power (1$\beta$) and Effect Size $\delta$
 Significance Level is usually $\alpha=0.05$
 Power: How many events do we need in test/control to get a statistically significant result. Usually $1\beta = 0.8$
 Choose Unit of Diversion & Sample size
 How will you split the users among tests? Page view, cookie or user id? Also consider what population will get a test, how to segment your users?
 Run Experiment
 Analyze results for significance
 Check invariants
 Check for significance in result
 Identify Goals

explain pvalue
 Probability that the null hypothesis is true. Typical cutoff is
.05
5% so anything under that we reject the null hypothesis, meaning that the effect is not due to randomness.
 Probability that the null hypothesis is true. Typical cutoff is

In an AB test, how to measure regression?
 If the test statistic is lower than the control baseline
Feedback
 To decide which of two models is better, proposed an experiment design where some metrics (like latency) stay invariant and some change (like CTR)
 Explained the intuition between confidence bounds and number of impressions (the more impressions, the more confident we are)
 Could not describe any statistical test or how to estimate confidence bounds
 Get sample mean.
 Get Z score from table (90, 95 or 99%)
 Get standard error = $\frac{\sigma}{\sqrt(n)}$
Estimation
MLE Maximum Likelihood Estimation
Method to fit data to a distribution. Goodfellow (2016) gushes over it.
When you try MLE on a gaussian you get MSE. When you try MLE on a Binomial you get binary classification. This means that implicitly when you use MSE, you are doing regression (estimation) and when you use CE, you are doing classification.
MLE is a special case of Maximum Aposteriori Estimation (MAP) where the prior is uniform.
Normal
given some measurements (ie weight of rabbits), get the normal distribution likelihood function which is parameterized by $\mu$ and $\sigma$ and then maximize the likelihood function to find the estimates for the parameters that fit the data.
MLE works down to:
Binomial
Given some bernuli samples (ie free throws ) try to fit the binomial distribution, which is paramteramized by $p$, to the data. Get the PMF of the Bernoulli Distribution: $f(x_i \given p) = p^x (1p)^{1x}$ .
The Likelihood function $L(p\given x)$ is the product of each data point given the parameter:
Take the log for connivence down the road:
And you have recovered binary cross entropy ^^^^
Take derivative w/ respect to $p$ and solve for 0.
Poisson
Random sample of the number of textbooks that a person buys for a semester at uni.
Poisson Dist is parameterized by $\lambda$ . PDF:
Now compose Likelihood function:
Log and take derivative at 0 to get the MLE estimator of $\hat\lambda_{MLE} =\frac{ \sum x_i}{ n}$
MAP Estimation
MLE is a special case of maximum a posteriori estimation (MAP) that assumes a uniform prior distribution of the parameters.
Includes Prior knowledge!
Bayes Rule:
Where $P(\Theta)$ can be a uniform distribution and thus ignored and $P(D)$ is considered a constant and thus ignored.
Or equivalently:
Where the first part is the posterior and the second part is the likelihood prior.
 When is MAP same as MLE?
 When there is no prior or when the prior is uniform.
Expectation Maximization
Lambda Loss Paper uses this….
A way to calculate MLE/MAP
EM is usually useful when it’s hard to calculate the gradient of the loglikelihood. In cases I’m familiar with, this is due to the presence of latent variables per datapoint. You see it in Gaussian mixture models, hidden Markov models, and other problems. In order to avoid calculating the gradient, you iteratively:
 take the expected complete data loglikelihood
 maximize it
This is guaranteed to converge to a local optimum.
Time Series Analysis
 Auto Correlation: where outputs depend on past inputs. Statistical independence of errors. Also independence between observations. Use ACF plot to quantify.
 Running correlation analysis on the same variable (not two). So $(y_i, y_{i1}), (y_{I1}, y_{I2})$
 How strongly are todays temparutes related to yesterdays?
 Running correlation analysis on the same variable (not two). So $(y_i, y_{i1}), (y_{I1}, y_{I2})$
 Auto Regressive: current value is based on the immediately preceding value. Therefore an AR model assumes some degree of auto correlation. Like Linear regression assumptions require the absense of AC.
ARIMA: Auto Regressive …
Autocorrelation (denoted by an ACF plot) is the actual correlation between a data point and its lag (some period prior). When this condition exists and may call for more advanced modeling than simple Linear Regression.
Autoregression is a term used to describe the process/model (aka Autoregressive model) when one value may be affected by prior values.
AR models, meaning regression models that use the past as input features, are no different from regression with timeunrelated features. What we are doing is just using previous yvalues as input for a given point. We can manually construct such features directly.
When we add a 2lag feature (meaning grosses from two weeks ago & last week in our case), the model is called AR2 (AR for AutoRegressive). When we use up to three, it is called AR3 and so on. Let’s do an AR2.
ML
 Auto Correlation: where outputs depend on past inputs.
Class Imbalance
Good to look at recall
 Down/Up Sample: balance class weights, however, you throw away a lot of data
 Weighted Loss: Weight the loss function by the inverse class frequency count
 Robust Modeling: Use a model that isn’t sensitive to class imbalance:
 Treat as anomaly detection problem: Isolation Forrest
 Boosting: sequential models will adapt to imbalance
Feature Engineering
BOW (Vocabularies)
When you have a variable list of features (like words in a sentence, or number of products a user liked), then a common pattern to is transform this into a fixed length feature vector across the whole vocabulary $V$.
Buckets (Continuous Features)
Continuous features don’t need extra treatment, however, signal can be gained by knowing their properties. These tricks can help you stay on a linear model :)
Looking at the distribution of features, we can define relevant buckets for continuous variables, such as counters or timestamps. For instance, consider the number of products a user has seen. Seeing four products definitely reflects more engagement than seeing only one product. However, seeing a hundred and four products is not that different from seeing a hundred products, hence the need to bucketize continuous variables.
Interesting case study from scikit: https://scikitlearn.org/stable/auto_examples/preprocessing/plot_discretization.html#sphxglrautoexamplespreprocessingplotdiscretizationpy
In 1D the linear regression can’t fit the data, but in 10 dimensions it can! This is because the inflexible linear model can’t fit the nonlinear curve in 1d. However, if we let it fit in multiple dimensions, we can gain some expressiveness.
For example, consider raw data that represents the year a house was built. Instead of representing that year as a scalar numeric column, we could split the year into the following four buckets:
Date Range  Represented as… 

< 1960  [1, 0, 0, 0] 
>= 1960 but < 1980  [0, 1, 0, 0] 
>= 1980 but < 2000  [0, 0, 1, 0] 
>= 2000  [0, 0, 0, 1] 
Why would you want to split a number—a perfectly valid input to your model—into a categorical value? Well, notice that the categorization splits a single input number into a fourelement vector. Therefore, the model now can learn four individual weights rather than just one; four weights creates a richer model than one weight. More importantly, bucketizing enables the model to clearly distinguish between different year categories since only one of the elements is set (1) and the other three elements are cleared (0). For example, when we just use a single number (a year) as input, a linear model can only learn a linear relationship. So, bucketing provides the model with additional flexibility that the model can use to learn.
Feature Crosses
Moreover, some variables make more sense when crossing them together to analyze the cooccurrence effect, such as the number of banner a user has seen on a particular device. The “fatigue” effect is heavily dependent on the screen used, and consequently on the device type the user is on. See factorization machines.
Embeddings
Learnt dense representation of sparse (categorical) data.
Anytime you have a categorical feature, you can probably learn an embedding
KMeans Clustering
Clustering method which tries to minimize the incluster SSE. Parameterized by $K$ for the number of clusters/centroids. Very sensitive to the initial location of the centroids.
 Initiation of centroids (random or heuristic – very important)
 Assignment of data points to 1NN centroids
 Update centroids location (center of cluster/mean of cluster points)
 Repeat #2 and terminate upon convergence
Figure: KMeans run for 4 iterations
Linear Regression
R2
Where SEE is the mean squared loss of the optimized model and SST is the MSE loss of the average of the data. 0 is worst, 1 is best.
You cannot blindly trust the R2 score or other summary statisitcs as a measure of fit goodness. Particularly, you cannot ignore the assumptions of a regression, which are:
(Most of these assumptions are in place to ensure the validity of the coefficient weights. For example if you ignored these assumptions and took some meaning from the coefficients then you will be misled.)

Linear relationship between predictor and response and no multicollinearity
 No autocorrelation (statistical independence of the errors)
 This is b/c if your observations are autocorrelationed, then there is multicollinearity which violates the first assumptions.
 Bias: Your “best fit line” will likely be way off because it will be pulled away from the “true line” by the effect of the lagged errors.
 Inconsistency: Given the above, your sample estimators are unlikely to converge to the population parameters.
 Inefficiency: While it is theoretically possible, your residuals are unlikely to be homoskedastic if they are autocorrelated. Thus, your confidence intervals and you hypothesi tests will be unreliable.
 Normality and Homoscedasticity (stationary/constant variance) of the residual (error) distribution
Least Squares
Closed form/analytical solution to the MSE/squared residuals/SSE error/loss function of linear regression.
Problems:
 Requires all data to do a matrix inversion: not computationally tractable in some cases
 The the MSE loss is sensitive to outliers
 (Classification) Assumes a Gaussian conditional distribution, whereas binary target vectors clearly have a distribution that is far from Gaussian
Basis Function/Kernel
I think these are called kernels in Scikit, not sure the diff…
transforms data points. Allows a linear combination in $w$ to learn nonlinear relationships.
Consider the simple linear regression:
Now consider the basis form:
Where $\phi: R^d \rarr R^m$ (this could be nonlinear)
Then we can view the above mentioned simple liner regression just a selection case where the basis function is the identify function $\phi(x)=x$
Example Basis functions:
 Polynomial (degree 2): $\phi(x) = (1, x(1), x(2), x(1)^2 x(2)^2, x(1)x(2))$ for $X \in R^2 $
 Sigmoid:
 Radial: Tranform to polar coordinates. Popular for SVM
 Fourier: DSP
 Wavelet: useful in image processing. Combination of Fourier and RBF.
For example, you cannot learn the XOR function using a linear model w/ an identity basis function. However, you can w/ a polynomial basis function, b/c the $x_1 * x_2$ term is linearly separable in $w$.
Great example here in the scikit docs: https://scikitlearn.org/stable/modules/linear_model.html#polynomialregressionextendinglinearmodelswithbasisfunctions
Figure: Illustration of the mapping φ. On the left a set of samples in the input space, on the right the same samples in the feature space where the polynomial kernel K(x,y) (for some values of the parameters c and d) is the inner product. The hyperplane learned in feature space by an SVM is an ellipse in the input space.
Classification
Discriminate Function: Assigns sample point to a class.
 Binary Classification: $2$ classes
 Multiclass Classification: $K$ classes
 Onevsall: train $K1$ binary classifiers. However, you will see this leads to ambitious decision surfaces
 Onevsone: train $K(K1)/2$ combinations of binary classifiers for $K$. Still has ambitious dec. surfaces
 Multinomial: (this is the obvious choice I would code up before I even know the two other options existed). train 1 classifier w/ matrix of weights for all classes. The output is the dot product of the input $x$ w/ each set of weights for the respective class. So if you have 6 classes, you will get 6 scaler dot product values and the
argmax
corresponds to the winning class. What’s the difference with this and softmax?
Logistic Regression
Where $\sigma$ is an activation function. You can also consider x as a basis function $\phi(x)$ as in Regressions.
The regression coefficients are usually estimated using maximum likelihood estimation.[26] Unlike linear regression with normally distributed residuals, it is not possible to find a closedform expression for the coefficient values that maximize the likelihood function, so that an iterative process must be used instead; for example Newton’s method. This process begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this revision until no more improvement is made, at which point the process is said to have converged.[26]
Binary Cross Entropy Loss
Binary xent can be recovered by maximizing the likelihood of a binomial distribution.
Regularization
Ocam’s Razor: Among competing hypothesis, the simplest is the best. In parametric models, simplicity can be considered either:
 Less parameters: L1 Lasso
 Smaller parameters: L2 Ridge
L2 regularization assumes a prior belief that weights should be small and normally distributed around zero. Interesting!
This is an interesting visualization from ISL which shows a red unregularized convex loss function in two dimensions w/ parameters $\beta_1$ and $\beta_2$, where the optimal value is $\hat{\beta}$. It also shows the two constrained regions of the L1 and L2 norm, which are centered around zero and small in magnitude. Thus we can see the optimal regularized optimum is not $\hat{\beta}$ but rather the intersection of the two regions. Notice that this makes both losses have smaller magnitude and in the L1 case makes the solution space sparse as $\beta_1=0$!
So in other words, to summarize, the absolute value’s going to be a constraint region that has sharp corners. In high dimensions, you have edges and corners. And along an edge or a corner, if you hit there, you get a 0. So this is, geometrically, why you get sparsity in the Lasso.
 Weight decay (above)
 Bagging (Dropout)
 Early Stopping
Ensembling
There is no universal model that preforms best on all data. Each model has tradeoffs, advantages/disadvantages against others (see no free lunch principle). In practice, it turns out that instead choosing the best model out of a set of models, if we combine many various, the results are better — often much better — and at little extra effort.
Creating such model ensembles is now standard. In the simplest technique, called
 bagging, we simply generate radom variates of the training set by revamping, learn an overfit classifier on each, and combine the results by voting. This works because it greatly reduces variance while only slightly increasing bias.
 boosting, training examples have weights, and thse are varied so that each new weak classifier can focus on the examples the previous ones tended to get wrong.
 stacking, the outputs of individual classifiers come the inputs of a “higherlevel” larger that figure out how best to combine them.
Tree Models (non parametric)
Decision Tree
Recursively Splits the decision space. Tries to partition the data in such a way to make PURE leafs – homogeneous groups.
 Good for nonlinearlyseperable data, usually next choice after linear regression
 Prone to overfitting (high variance / low bias): Use bagging to alleviate: see random forrest
 Interpretable: visualize decision surface.
 lowerdimensional data is better / high dimensional will result in deep tree (read overfitted)
Entropy
How do you measure purity/disorder? Entropy:
Where $p_i$ is the probability of class $i$. In the binary classification case, we can rollout the summation to:
Where p and q are the respective categories.
 What is the entropy of a homogenous set?
 $1 \log_21 = 0$
 What is the entropy of a set w/ a 50/50 mix?
 $(.5\log_2.5)  (.5\log_2.5) = 1$
X  Y  Class 

Green Circle  
Ping Cross  
Green Circle 
Figure: 16 greens and 14 ping = 30 total
What is the potential entropy in the distribution above?
Very high entropy. This is a good column feature/column to split on.
Information Gain
Information Gain is the Average Entropy of the children subtracted from the the entropy of the Parent
Algorithm
 For the given branch (root if first iteration)
 Find (exhaustively) feature(column) w/ most potential information gain $f_i$
 For each column, For each value, calculate info gain, and return largest.
 Split the branch among the target classes with respect to $f_i$
 For each split (2): Repeat Step #1, unless leaf!
 Find (exhaustively) feature(column) w/ most potential information gain $f_i$
Pruning
Basic Dtrees massively overfit, so pruning helps simplify the tree in order to generalize better
Bagging
 Bootstrap Sampling: Random sample with replacement in order to lower variance (overfitting)
 RF: Averages the results of dtrees overfit on bootstrap samples.
The theory bagging, in a nutshell, is that you have multiple models that you blend together to reduce variance and make your predictions more stable. That’s how a random forest works, it is a combination of n_estimators
decision tree models that use majority voting (in the case of Random Forest Classifier) or straight averaging (in the case of Random Forest Regressor). Random Forests are called Bagging MetaEstimators. Bagging reduces variance by introducing randomness into selection of the best feature to use at a particular split. Bagging estimators work best with when you have very deep levels on the decision trees. The randomness prevents overfitting the model on the training data.
In bagging, we use many overfitted classifiers (low bias but high variance) and do a bootstrap to reduce the variance.
Bagging is short for Bootstrap Aggregation. It uses several versions of the same model trained on slightly different samples of the training data to reduce variance without any noticeable effect on bias. Bagging could be computationally intensive esp. in terms of memory.
The intuition behind bagging is that averaging a set of observations reduces the variance. Given $Z_n$ observations w/ variance $\sigma^2$, the variance of the mean $Z$ is given by $\frac{\sigma^2}{n}$. Hence a natural way to reduce the variance and hence increasing the predicting accuracy of a model is to take many samples of the training set and average the resulting predictions.
Examples: Random Forrest
Boosting
 Lowers bias while maintaining variance.
In boosting, we allow many weak classifiers (high bias with low variance) to learn form their mistakes sequentially with the aim that they can correct their high bias problem while maintaining the lowvariance property.
Examples: AdaBoost, GradientBoost
AdaBoost
Boosting relies on training several (simple, usually decision stumps) models successively each trying to learn from the errors of the models preceding it. Boosting differes from other algorithms in that in addition to it gives weights to training examples (as opposed to linear models which apply weights on features). So in essence we can weight the scarcer observations more heavily than the more populous ones. Boosting decreases bias and hardly affects variance (unless you are very sloppy). Depending on your n_estimators
paramenter you are adding another innerloop to your training step, so the the price of AdaBoost is an exensive jump in computational time and memory.
Kernels
Making data linearly separable by transforming it
https://www.quora.com/Whatis“kernelmethods”
SVM
Maximum Margin Classifier. Tries to find 2 points, supports, which are the furthest away from the decision boundary. Trying to maximize the width of the street. Mathematical guarantee to find global maxima.
Bias/Variance Tradeoff
If you model is overfitting, then it is highvariance: learning noise. If it is underfitting, it is too biased and not capturing the signal.
Often this is quantified by looking at the Learning Curves:
How to remove variance (overfitting):
 Get more data
 Reduce Parameters/Regularization
 Bagging: Use less flexible model (eg linear)
 Early Stopping
 Also see data augmentation and batch norm
How to remove bias (underfitting):

Get more data

Boosting: More flexible model/parameters
Evaluation
 Precision: ability of the classifier not to label as positive a sample that is negative $\frac{TP}{ TP + FP}$
 Recall: ability of the classifier to find all the positive samples. (Min false negatives) $\frac{TP}{TP+FN}$
 PrecisionRecall is a useful measure of success of prediction when the classes are very imbalanced.
 PR Curve: P&R at different thresholds cutoff thresholds in the binary classifier. The precisionrecall curve shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall). precisionrecall curves are appropriate for imbalanced datasets
 High Recall (Optimistic System): A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels
 High Precision (Pessimistic System): A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels.
ROC
ROC curves are appropriate when the observations are balanced between each class
AUC
An evaluation metric that considers all possible classification thresholds.
The Area Under the ROC curve is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive
Questions
 Linear Regression Assumptions:
 Linearity
 Constant variance of residual (stationarity)
 Normal distribution of variance with mean = 0
 What is stationary signal?:
 one which has a constant mean and variance
 Write out a function to calculate the AUC of an ROC curve
 The integral of the curve
Optimization
0th Order Black Box
Only $f(x)$ can be evaluated
1st Order Gradient Based
$\nabla{f(x)}$ Can be evaluated
Vanilla GD has fixed shared learning rates across dimensions. Later innovations brought momentum. The latest innovations are in adaptive and individual learning rates across dimensions.
 Learning Rate: scales the gradient. Therefore the higher the learning rate, the more confidence you have in the quality/accuracy of your gradient. The higher the learning rate the faster you can possibly converge.
 Batch Size: The was an interesting analogy I heard about getting advice from a drunk person for directions. If he was very drunk you would only take little confidence in his directions, you’d go and ask the next person. However, if it was a sober person, you would have higher confidence in their direction and put more weight on it. The size of your batch is like how drunk the person is. A large batch is a sober person and small batch is a drunk person: $\abs{B}\propto \lambda$ . However, as your batches get smaller, the gradients have a lot more noise, which some may say will help not getting stuck in local minima.
Figure: chart from fastAI showing Loss correlation with LR and Batch size. Largest batch is blue which shows the smoothest curve and highest (optimal) learning rate. Purple is smallest batch size and you can see it has the lowest optimal learning rate.
(Vanilla) Gradient Descent is very sensitive to step size. In illconditioned setting, it’s very useful to have an adaptive step size: in regions of large variability of the gradient we would like to take small step sizes, while on regions with small variability we would like to take large step sizes. If the function evaluated at the new gradient is smaller than current function value (minus a the norm of the gradient) then, then the current LR is maintained. Else wise, it’s reduced to deal w/ the high variance loss surface.
Figure: Gradient Descent w/ global adaptive LR (Backtracking line search)
2nd Order
$\nabla^2{f(x)}$ Can be evaluated
Kfac, Newton, GaussNewton, QuasiNewton, (L)BFGS
Newton’s Method
Newton’s method can be used to find a minimum or maximum of a function $f(x)$: .
The derivative is zero at a minimum or maximum, so local minima and maxima can be found by applying Newton’s method to the derivative. The iteration becomes:
Neural Networks
Function Approximtors.
Multilayer perceptron is a misnomer. It should be multilayer logistic regression!
B/c of the nonlinear activations, we don’t have a solve the nonconvex problem iteratively.
Deep Learning
 Batch Norm: Normalization at inner layers (helps overfitting and convergence time – higher learning rates). When training Deep Neural Networks (DNNs), it is common practice to normalize or whiten features by shifting them to have zero mean and scaling them to have unit variance. It works by internally maintaining an exponential moving average of the batchwise mean and variance of the data seen during training. The main effect of batch normalization is that it helps with gradient propagation—much like residual connections—and thus allows for deeper networks. Typically used after a convolutional or densely connected layer.
 How does Batch Normalization Help Optimization? (NeurIPS 2018): Argues that ICS (shift) has nothing to do w/ BN, but rather, it helps smooth the gradients and loss surface. https://twitter.com/aerinykim/status/1070188133810139136
 The NIPS loss surface paper form Washington that was on Arxive for a while shows how batch norm (and skip connections) smooth out the loss surface.
 Dropout: Regularization technique. Dropout is one of the most effective and most commonly used regularization techniques for neural networks, developed by Hinton and his students at the University of Toronto. Dropout, applied to a layer, consists of randomly “dropping out” (i.e. set to zero) a number of output features of the layer during training. The “dropout rate” is the fraction of the features that are being zeroedout; it is usually set between 0.2 and 0.5. At test time, no units are dropped out, and instead the layer’s output values are scaled down by a factor equal to the dropout rate, so as to balance for the fact that more units are active than at training time.
 Ensembling Analogy: Dropout can been seen as an ensembling technique using bagging.
 CNNs: signal processing technique that is more parameter efficient than a FC.
 Activations: Introduce nonlinearities, otherwise layers are pointless as can be combined into 1 linear operation
 Gradient Accumulation: If you cannot fit large batch sizes on your GPU, you can just average gradients over runs. Then apply them.
In theory, the deeper you go the more expressive power you can achieve, however there are tradeoffs:
 Diminishing Gradients: Backpropagated errors become very small after a few layers, which makes learning ineffective. This can be solved by pretraining, e.g. starting with initial weights that are close to the final solution
 Too Many Parameters:
CNNs
CNNs are motivated by 3 factors:
 Locality: neighbor relationships in spatial data (images, audio, sequences) are ignored by FFNs (see: local receptive fields below)
 Invariance: FFNs are not translation and scale invariant. They will memorize orientation and size. To mitigate this you would have to heavily regularize or do data augmentation of all possible combinations of orientation translations and size scales. (See subsampling)
 Parameter Scale: Fully Connected Networks don’t scale well especially to image data: e.g. 200x200x3 image, would lead to neurons that have
200x200x3 = 120,000
weights. Multiple layers of this would lead to overfitting due to too many parameters. A CNN layer on the same input volume would only need to learn10x5x5x3 = 750
parameters! (Assuming 10 5x5 filters over 3 channels ). (See Weight sharing below)
These limitations are remedied in CNNs by:
 Local receptive fields
 Weight sharing
 subsampling
Parameters:
 Input volume size: $W$
 Receptive field size: $F$
 Stride: $S$
 Zero padding: $P$
 FIlter count: $K$
The figure above is a great example of the advantage that weight sharing in CNNs give. A 6 5x5 filters on a 32x32x3 image give a give a 28x28x6 activation map with only 5x5x3x6=450 parameters. Compared to 14M parameters if we flattened the image and did it w/ a fully connected layer!
Parameters
 Number of
Filters
$f$  Kernel Size of
Filter
$s$
Settings:
 Stride: how to move the window
 Padding:
Each “A” sees two inputs. Each “B” sees two “As” which have a common input so each “B” has a receptive field of 3. Each B is exposed to exactly 3 inputs. In other words, receptive field grows linearly w/ layers.
Such 1D convnets can be competitive with RNNs on certain sequenceprocessing
problems, usually at a considerably cheaper computational cost. Recently, 1D conv
nets, typically used with dilated kernels, have been used with great success for audio
generation and machine translation. In addition to these specific successes, it has long
been known that small 1D convnets can offer a fast alternative to RNNs for simple tasks
such as text classification and timeseries forecasting.
Causal
Like LSTMs, causal convolutions can model sequences with longterm dependencies. This is achieved in two ways: stacking convolutional layers (with padding, every convolutional layer preserves the shape of the input), and dilation: insertion of gaps into the convolutional filters (otherwise known as atrous convolutions). convolutional filters applied to the sequence in a lefttoright fashion, emitting a representation at each step. They are causal in that the their output at time t
is conditional on input up to t1
: this is necessary to ensure that they do not have access to the elements of the sequence we are trying to predict
Spotlight CNN Sequential model: https://maciejkula.github.io/spotlight/sequence/representations.html#spotlight.sequence.representations.CNNNet
The convolution is causal because future states are never part of the convolution’s receptive field; this is achieved by leftpadding the sequence.
Dilation: setting this to a number greater than 1 inserts gaps into the convolutional layers, increasing their receptive field without increasing the number of parameters. Stacked dilated convolutions enable networks to have very large receptive fields with just a few layers, while preserving the input resolution throughout the network as well as computational efficiency.
Dilated convolutions AKA atrous convolutions AKA convolutions with holes are another method of increasing the receptive field without angering the gradient gods. When we looked at stacking layers so far, we saw that the receptive field grows linearly with depth. Dilated convolutions let us grow the receptive field exponentially with depth.
Here we have a 3X3 filter applied with a dilation of 1,2 and 3. With a dilation of 1 we have a standard convolution. With a dilation of 2 we apply the same 3X3 filter but use every second pixel
Oord, Aaron van den, et al. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (2016).
Separable
P262 of the Keras book
Capsules
CNNs are translation invariant, however, we want them go equivariant!
One of the goals of subsampling/pooling is to introduce invariance to the model. Meaning it tries to make the neural activities invariant to small viewpoint changes.
Input > Conv > relu > pooling
Subsampling or pooling in convents looses the spatial relationships in the images. for example a face isn’t an arbitrary combination of eyes + nose + ears + mouth, it’s a specific arrangement of the pieces.
Pooling throws away information about the precise position of the entity within the region. Pooling reports the activity of the most active neurons and then forgets where they are, which gives you a small amount of translational variance.
Capsules are also robust against adversarial attacks b/c they learn the distribution of each class. This can be used to check for adversarial input from a different distribution.
Image Classification w/ Capsules
1 capsule/class in the top layer. so most examples only have 10 capsules (eg mnist) in first layer, but ImageNet would have 1000. it is at least O(n) where n is the number of classes. routing by aggreement causes it to be slow to train b/c of the inner loop.
RNNs
https://medium.com/@TalPerry/convolutionalmethodsfortextd5260fd5675f
Feed characters into layers of RNNs and get final state at end that represents the sequence.
With a basic RNN cell, we see a massive drop in performance when it comes to long sequences and the network needs to remember patterns which have occurred way at the beginning to infer things correctly at a current time step. And this is because of exploding and vanishing gradients.
LSTMs, which can remember information from the way past and also selectively forget stuff that is not required.
Residual/Skip Connections
“How do I make my network very deep without losing the gradient signal?”
The general idea is to reduce the distance between the signal coming from the networks loss and each individual layer. The way this is done is by adding a residual/direct connection between every layer and its predecessors. That way, the gradient can flow from each layer to its predecessors directly.
Applied ML
Recommender Systems
 ItemItem (Lindon. AMZN)
 MF: Latent Factor Models. No concept of time.
 ALS MF
 Pairwise $P(i > j \given U)$ Ranking objectives
 BPR
 Next Item Models: $P(i_{t+1} \given U, C)$. Better objective more aligned w/ most rec tasks. Weak concept of time.
IR

TwoPhase System: First stage is candidate generation from full corpus: high precision/fast response. The second stage is a high recall ranker which is more complex and uses more data sources.
 Personalized Search: The First stage can be a recommender system or a search engine which is personalized. The second stage can also use personalization give a user and context.
LTR
supervised task to rerank a list for optimal order. Data can be relevance judgements or log data. Log data gives you implicit ranking preference from clickstreams. With clickstream data you can either model clicks or ranking.
Architectures:
 RankNet: siamese network w/ AUC objective $P(i>j)$
 LambdaRank: directly optimizes NDCG objective
 YouTube: models user video watch time as ranker (not LTR per se)
Object2Vec
Distributed Representation/Representation Learning are popular across all fields currently. It all started w/ word2vec where items are clustered in an embedding space with respect to how similar they are in some $\text{context}$. These dense representations are then commonly used for downstream tasks like features for classification.
 Uber learns sentence embeddings using doc2vec where a sentence is from a chat message. Not sure what is the context here. And how to use it of unobserved sentences….
Word2Vec
Given a context, surrounding words, predict the target word. Error propagates back down to the embeddings.
 CBow: given context, predict target $P( T \given C)$
 Skipgram: given target, predict context $P(C\given T)$
 Glove
Figure: Skipgram architecture
Candidate Sampling
Softmax over 5000 classes is slow.
So if you have 5000 classes to evaluate the softmax layer you need to do 5000 forward passes through your function $f$ fro the different values of $x_i$. Computational Complexity: $O(\abs{V})$
This has motivated work into approximate softmax methods often called Candidate Sampling:
 Hierarchical Softmax
 Sampled Softmax
 Noice Contrastive Estimate (NCE)
Text Classification
Given a sentence model the probability of a it’s label $P(y\given X)$ where x is a variable length list of words.
Nice guide here: https://developers.google.com/machinelearning/guides/textclassification/
Because bagofwords isn’t an orderpreserving tokenization method (the tokens gen
erated are understood as a set, not a sequence, and the general structure of the sen
tences is lost), it tends to be used in shallow languageprocessing models rather than
in deeplearning models. Extracting ngrams is a form of feature engineering, and
deep learning does away with this kind of rigid, brittle approach, replacing it with hier
archical feature learning.
BOW
Classic approach which encodes the sentences/observations into fixed length feature vector length $\abs{V}$ . The encoding can be simple indicators for a word or counts or more popularly tfidf counts.
Once you have your fixed length feature, you can use any type of classifier.
Sequential
Instead of a fixed length feature vector w/ frequency information, we learn/use an embedding for each word/token.
 NGram: groupings of sequential words of Len N
 BOW: Fixed length feature vector of Ngram counts over $V$.
 Independence between BOW features is assumed, however, that is mitigated a bit by the fact the BOW features may infect be NGrams which will encode some dependence (sequence info)
 FeedForward:
 Ngrams: fixed length feature vector of ngrams over $V$ (very high dimensional)
 BOW: fixed length feature vector of averaged ngram embeddings (low dimensional)
 RNN: feed embeddings to RNN layers
 CNN: Convolve over ngram sequence of embeddings
Figure: A sequential Feedforward MLP(DNN). Note for the 3D embedding layer: Although the diagram is ambiguous, the embeddings for each activated movie are either added or averaged.
One strategy to combine the speed and lightness of convnets with the ordersensitivity of RNNs is to use a 1D convnet as a preprocessing step before an RNN (see figure 6.30). This is especially beneficial when you’re dealing with sequences that are so long they can’t realistically be processed with RNNs, such as sequences with thousands of steps. The convnet will turn the long input sequence into much shorter (downsampled) sequences of higherlevel features. This sequence of extracted features then becomes the input to the RNN part of the network.
NLP
 Named Entity Recognition
 Coreference
 Part of Speech Tagging
Anomaly Detection
Isolation Forests: Unsupervised. Overfit a a bunch of trees (forest) and then the average path length to a point is called the anomaly score. Put in a threshold and you have a classifier. The random partitioning, on average, creates shorter paths for anomalies/outliers, which is intuitive b/c it should be hard to separate inliers.
Permalink: machinelearningglossary
Tags: