Machine Learning Glossary
ML Glossary
TODO
D trees (probably code one to learn it best): Pretty much did this for classification of numerical and categorical variables. Didn’t explore regressions though… Think they just take the mean of the leaf values
 Simple GAN w/ mnist (port my TF example to Keras)
 Boosting
 One cycle FastAI stuff for LR scheduling
 Seq2Seq Attention
 NDCG
 Quantile Regression
 Revisit SVD in context of Rec Sys and Dimensionality Reduction (dim reduction vs PCA)
 Revisit SVMs
 Maximum likelihood questions: solve exponential function and get the maximum likelihood estimator.
 Probability questions: There are 100 product and 25 of them is bad. What is the confident interval.
 LDA connection to PCA
 MLE ME MAP < learn these, especially MLE
 Probability Calibration (scikit)
 The distribution of predictions should match the distribution of of the data. If it is skewed, it should be corrected (calibrated). This could also be connected to Netflix’s Calibrated Recommendations at recsys ‘18.
 Simplex Algorithm (liner programming) <asked if you can solve a linear equation
 ordinary least squares vs partial least squares
 Survivorship analysis (long tail processes re Better Mortgage)
 Generative Models
 He asked if I work w/ them and I must admit I don’t think like that. VAEs… GANs…
Information Theory
Shannon Entropy
Information entropy is the average rate at which information is produced by a stochastic source of data.
Where $p_i$ is the probability of class $i$. In the binary classification case, we can rollout the summation to:
Where p and q are the respective categories.
 What is the entropy of a homogenous set?
 $1 \log_21  0\log_20 = 0$
 What is the entropy of a set w/ a 50/50 mix?
 $(.5\log_2.5)  (.5\log_2.5) = 1$
Figure: Distribution of entropy for binary random variable (coin). For a fair coin p=.5, there is maximum entropy.
Graph Search
 DFS: Traverse down branches in lexical order
 Hill: Traverse down branches in using heuristic. Heuristic is shortest path to goal.
 BFS: Exhaustively traverse across branches
 Beam: Heuristic to traverse only best $k$ branches
Recursive DFS tree traversal
def traverse(node, visited):
visited.add(node)
for node in node.children  visited:
traverse(node, visited)
return #leaf
traverse(root, set())
Recursive BFS tree traversal:
q = []
def traverse(node, visited):
while q:
node = q.pop()
visited.add(node)
for node in node.children  visited:
queue.append(node)
traverse(root, set())
Linear Algebra
Norm
Measurement of the magnitude of a vector. Typically used for quantifying model parameter complexity, where complexity is viewed as either vector magnitude or sparseness.
 L0
 L1 (Manhattan): $\norm{x}1 = \sum{i=0} \abs{x_i}$
 L2 (Euclidian): $\norm{x}2 = \sqrt{\sum{i=0}x_i^2} = \sqrt{x\cdot x…}$
 Frobenius: Matrix Norm
Distance
Distance measure between two vectors $A$ and $B$
 Euclidian: $norm(AB)$ which is to say the size of the difference of the vectors
 Dot Product : $a\cdot b = \norm{A}\norm{B} cos(\theta)$ (loose measure of similarity. Bigger if a and b are more simular)
 Cosine: Normalized dot product : $\theta = 1 \frac{a\cdot b}{\norm{A}\norm{B}}$ (you can see the dot product in numerator, but then it’s divided by the lengths of the vectors in denominator)
Dot Product = $\sum_i \sum_j i\cdot j$
Cosine: normalized dot product (angle between two vectors)
Statistics
 Central Limit Theorem: Consider the sum of multiple random variables. The sum of a set of random variables, which is of course itself a random variable, has a distribution that become increasingly Gaussian as the number of terms in the sum increases.
 I think this is useful in hypothesis testing if you run a bunch of random experiments, you can assume the distribution of the results is Normal/Gaussian (even if the individual random variable isn’t normally distributed) and then do a ttest.
 Do the hypothesis testing exercise on data quest which I always thought was nice!
 Variance: how one variable varies with itself: $\frac{1}{n}\sum_{i=0}^x (x_i  \hat{x})^2$ (sum of squared diffs from mean)
 Covariance: how one variable varies w/ others: $\Sigma$ (mean of the product of each variables variance)
 Correlation: Measure of the linear relationship between two variables. Takes value between +1 and −1 inclusive: $Corr(X,Y) = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}$ (normalized variance)
Probabilities
 Conjunctive: probability of
Event1
ANDevent2
; multiply probabilities  Disjunctive: probability of
Event1
ORevent2
; add probabilities  Conditional: probability of
A
givenB
$P(A\given B)$ : $\frac{P(A \cap B)}{P(B)}$  Bayes Rule: $P(A\given B) = \frac{P(B \given A) P (A)}{P(B)}$
Bayes Rule
Example: There is a cancer test that is 75% accurate (TP Rate/sensitivity):
The prior probably of having cancer in general is $P(Y=1)=.01$
The probability of a positive test $P(X=1)$:
What is the posterior probably $P(Y\given X)$ that you have cancer if you have a positive on the test?
Expectation Notation
For an unfair or weighted coin, the two outcomes are not equally likely. You can change the weight or distribution of the coin by dragging the true probability bars (on the right in blue) up or down. If we assign numbers to the outcomes — say, 1 for heads, 0 for tails — then we have created the mathematical object known as a random variable.
The expectation of a random variable is a number that attempts to capture the center of that random variable’s distribution. It can be interpreted as the longrun average of many independent samples from the given distribution. More precisely, it is defined as the probabilityweighted sum of all possible values in the random variable’s support.
Which equals 3.5 for a dice: $1/6 + 21/6… +61/6$
Variance: Whereas expectation provides a measure of centrality, the variance of a random variable quantifies the spread of that random variable’s distribution. The variance is the average value of the squared difference between the random variable and its expectation,
Distributions
 Skew: measures clustering of distribution. If you have a right skew it’s common to normalize the data (log).
 Kurtosis: measures the shape of distribution
 Modality: measures tendency of distribution
 Mean: measures central tendency but is sensitive to outliers (skew)
 Median: measures central tendency and is less sensitive to outliers (skew)
 Mode: most common number in dist
 Variance, $σ^2$: measures how far the average data point is from the mean: $\sum_i\frac{(\mu  x_i)^2}{\mu}$
 Standard deviation, σ: common way to refer to the distance between data points and the mean.
 Stationary: Mean and variance are constant. Typically seen In timeseries data.
 Normalization: Makes the data stationary meaning for random windows the mean and variance are constant
 Bernoulli Trail: Fail or success
 KL Divergence: Test similarity of two probability distributions. Minimizing the KL loss is generally used in MLE.

Confidence Interval

Normal: We can generate a normal distribution by using a probability density function.
 Parameterized by $\mu$ and $\sigma$

Binomial: sequential binary action: rain/dry, basketball shot/miss, market up/down
 Parameterized by: $n$ and $p$ number of events and prob of success respectively

Unimodal: One mode (normal)

Multimodal: More than 1 mode

Poisson: Number of calls received per hour by call center (when you want a normal w/o negative values…)

Parameterized by $\lambda$


Exponential: a skewed distribution that can model user time on a page in seconds

Parameterized by $\lambda$

Questions
 What is the expected value of flips to get heads on a fair coin?
 $E(x) \sim \text{Geometric}$, Expected value = $\frac{1}{p}$ = , i.e. is 1/.5 = 2
 Given a random Bernoulli trial generator, write a function to return a value sampled from a normal distribution.
 The trick is to sample from the normal CDF which samples from the Normal PDF
Experiment Design
 Null Hypothesis: Status Quo, random chance
 Alternative Hypothesis: statistically significant difference between the outcomes of the 2 groups.
 Onesided: null < x, alt >=x; Twosided (true/false): null=0, alt!=0
 Blind Experimentation:
 Control Group & Treatment Group
 Test Statistic: quantifies the data in the groups
 Statistical Test: Use Test Statistic to ran experiments
 Pvalue: probability quantifying random chance. The lower the better.
 Sample Size: refers to how large the sample for your experiment is. The larger your sample size, the more confident you can be in the result of the experiment (assuming that it is a randomized sample).
 $n = 16 \frac{\sigma^2}{\delta^2}$
 Effect Size/Minimum Detectible Effect $\delta$: The Minimum Detectable Effect is the smallest effect that will be detected $power\%$ of the time.
 refers to the size of the difference in results between the two sample sets and indicates practical significance. If there is a small effect size (say a 0.1% increase in conversion rate) you will need a very large sample size to determine whether that difference is significant or just due to chance. However, if you observe a very large effect on your numbers, you will be able to validate it with a smaller sample size to a higher degree of confidence. %5 is a common number.
 Confidence Interval: product of sample size and effect size. $\mu +Z\frac{\sigma}{\sqrt(n)}$
 ChaiSquared: Does the rate of success differ across two groups?
 Significance: Statistical significance is the likelihood that the difference in conversion rates between a given variation and the baseline is not due to random chance. ie if you run an A/B testing experiment with a significance level of 95%, this means that if you determine a winner, you can be 95% confident that the observed results are real and not an error caused by randomness.
 Significance Level $\alpha$: usually
0.05
 Type I error/Confidence Level (1⍺) — Chance of accepting a difference when there is not. 90%, 95%, 99%(a=.01)
 Type2 error $\beta$  Chance of rejecting an actual lift. β is inversely related to sample size.
 Statistical Power — (1β) is the power or sensitivity of experiment
 Product of Sample Size and Effect Size $\delta$
 Causal Inference???
 Confounding Factors: Causation is not correlation. unknown variables that might be influencing the test. For example, my conclusion is that Shoe Size is correlated w/ Literacy. However, the confounding factor there is age, which is actually correlated w/ literacy.
Figure: A normal/t distribution for the null hypothesis where $\mu=0$ w/ 2 tails highlighted being the regions the sample $\mu$ must lie for the p value to be below 0.05
. You can see the sample mean lies between, 2 and 1 and the goal of the test is to see if this is “weird” or not. It is weird if it is in the tail.
Statistical Tests

Covariance: relationship between the variances of two variables

Correlation: two variables change together
 Pearson: rvalue (continuous variables)
 Spearman: ordinal variables

$\chi^2$ Test: Similarity of two discrete categorical distributions (usually frequency distributions)

Comparison of Means

Ttest: Tests for the difference between two related variables (normality assumption) when you don’t know the population variance. (Otherwise ztest)

scipy.stats.ttest_ind
https://docs.scipy.org/doc/scipy0.15.1/reference/generated/scipy.stats.ttest_ind.html#scipystatsttestind 
#test w/ idential means >>> rvs1 = stats.norm.rvs(loc=5,scale=10,size=500) >>> rvs2 = stats.norm.rvs(loc=5,scale=10,size=500) >>> stats.ttest_ind(rvs1,rvs2) (0.26833823296239279, 0.78849443369564776) #accept null #Ttest with different means, variance, and n: >>> rvs5 = stats.norm.rvs(loc=8, scale=20, size=100) >>> stats.ttest_ind(rvs1, rvs5) (1.4679669854490653, 0.14263895620529152) #reject null


Welschs ttest: (improvement over classical ttests) comparison of two means (assumes normality)

Ztest: when we know the population $u$ and $\sigma$

Questions

In the experiment design portion you are asked to create a hypothetical plan to answer testing a new app.
 Identify Goals
 Identify Invariant and evaluation metrics
 Create Hypothesis
 Choose the test statistic
 Choose the statistical test (two sample tests): ttests, $X^2$ test, etc
 Choose significance level ($\alpha$), statistical power (1$\beta$) and Effect Size $\delta$
 Significance Level is usually $\alpha=0.05$
 Power: How many events do we need in test/control to get a statistically significant result. Usually $1\beta = 0.8$
 Choose Unit of Diversion & Sample size
 How will you split the users among tests? Page view, cookie or user id? Also consider what population will get a test, how to segment your users?
 Run Experiment
 Analyze results for significance
 Check invariants
 Check for significance in result
 Identify Goals

explain pvalue
 Probability that the null hypothesis is true. Typical cutoff is
.05
5% so anything under that we reject the null hypothesis, meaning that the effect is not due to randomness.
 Probability that the null hypothesis is true. Typical cutoff is

In an AB test, how to measure regression?
 If the test statistic is lower than the control baseline
Feedback
 To decide which of two models is better, proposed an experiment design where some metrics (like latency) stay invariant and some change (like CTR)
 Explained the intuition between confidence bounds and number of impressions (the more impressions, the more confident we are)
 Could not describe any statistical test or how to estimate confidence bounds
 Get sample mean.
 Get Z score from table (90, 95 or 99%)
 Get standard error = $\frac{\sigma}{\sqrt(n)}$
Estimation
MLE Maximum Likelihood Estimation
Method to fit data to a distribution. Goodfellow (2016) gushes over it.
When you try MLE on a gaussian you get MSE. When you try MLE on a Binomial you get Binary XEntropy. This means that implicitly when you use MSE, you are doing regression (continuous estimation) and when you use XEnt, you are doing (binary) classification.
MLE is a special case of Maximum Aposteriori Estimation (MAP) where the prior is uniform.
Normal
Given some measurements (ie weight of rabbits), get the normal distribution likelihood function which is parameterized by $\mu$ and $\sigma$ and then maximize the likelihood function to find the estimates for the parameters that fit the data.
MLE works down to:
Binomial
Given some bernuli samples (ie free throws ) try to fit the binomial distribution, which is paramteramized by $p$, to the data. Get the PMF of the Bernoulli Distribution:
The Likelihood function $L(p\given x)$ is the product of each data point given the parameter:
Take the log for connivence down the road:
And you have recovered binary cross entropy ^^^^
Take derivative w/ respect to $p$ and solve for 0.
Poisson
Random sample of the number of textbooks that a person buys for a semester at uni.
Poisson Dist is parameterized by $\lambda$ . PDF:
Now compose Likelihood function:
Log and take derivative at 0 to get the MLE estimator of $\hat\lambda_{MLE} =\frac{ \sum x_i}{ n}$
MAP Estimation
MLE is a special case of maximum a posteriori estimation (MAP) that assumes a uniform prior distribution of the parameters.
Includes Prior knowledge!
Bayes Rule:
Where $P(\Theta)$ can be a uniform distribution and thus ignored and $P(D)$ is considered a constant and thus ignored.
Or equivalently:
Where the first part is the posterior and the second part is the likelihood prior.
 When is MAP same as MLE?
 When there is no prior or when the prior is uniform.
Expectation Maximization
Lambda Loss Paper uses this….
A way to calculate MLE/MAP
EM is usually useful when it’s hard to calculate the gradient of the loglikelihood. In cases I’m familiar with, this is due to the presence of latent variables per datapoint. You see it in Gaussian mixture models, hidden Markov models, and other problems. In order to avoid calculating the gradient, you iteratively:
 take the expected complete data loglikelihood
 maximize it
This is guaranteed to converge to a local optimum.
Time Series Analysis
 Auto Correlation: Univariate analysis (denoted by an ACF plot) which shows the correlation between a variable and it’s lag (some period prior). This state violates the Statistical independence of observations assumption of Linear Regression, so when this condition exists and may call for more advanced modeling than simple Linear Regression (see AR Modeling).
 Running correlation analysis on the same variable (not two). So $(y_i, y_{i1}), (y_{I1}, y_{I2})$
 How strongly are todays temperatures related to yesterdays?
 Running correlation analysis on the same variable (not two). So $(y_i, y_{i1}), (y_{I1}, y_{I2})$
 Auto Regressive: a term used to describe the process/model (aka Autoregressive model) when one value may be affected by prior values. Current value is based on the immediately preceding value. Therefore an AR model assumes some degree of auto correlation. Like Linear regression assumptions require the absense of AC. An AR model doesn’t have traditional features. It has autocorrelated features. The feature are lagged responses (not predictors).
 AR models, meaning regression models that use the past as input features, are no different from regression with timeunrelated features. What we are doing is just using previous yvalues as input for a given point. We can manually construct such features directly.
 When we add a 2lag feature (meaning grosses from two weeks ago & last week in our case), the model is called AR2 (AR for AutoRegressive). When we use up to three, it is called AR3 and so on. Let’s do an AR2.
ARIMA: Auto Regressive Integrated Moving Average
Generative Models
 GMM
 VAE
 GAN
Discriminative Models
Everything Else!
 Logistic regression
 Support Vector Machines
 Maximumentropy Markov models
 Conditional random fields
 Neural networks
ML
 Auto Correlation: where outputs depend on past inputs.
Class Imbalance
Good to look at recall
 Down/Up Sample: balance class weights, however, you throw away a lot of data
 Weighted Loss: Weight the loss function by the inverse class frequency count
 Robust Modeling: Use a model that isn’t sensitive to class imbalance:
 Treat as anomaly detection problem: Isolation Forrest
 Boosting: sequential models will adapt to imbalance
Feature Engineering
BOW (Vocabularies)
When you have a variable list of features (like words in a sentence, or number of products a user liked), then a common pattern to is transform this into a fixed length feature vector across the whole vocabulary $V$.
Buckets (Continuous Features)
Continuous features don’t need extra treatment, however, signal can be gained by knowing their properties. These tricks can help you stay on a linear model :)
Looking at the distribution of features, we can define relevant buckets for continuous variables, such as counters or timestamps. For instance, consider the number of products a user has seen. Seeing four products definitely reflects more engagement than seeing only one product. However, seeing a hundred and four products is not that different from seeing a hundred products, hence the need to bucketize continuous variables.
Interesting case study from scikit: https://scikitlearn.org/stable/auto_examples/preprocessing/plot_discretization.html#sphxglrautoexamplespreprocessingplotdiscretizationpy
In 1D the linear regression can’t fit the data, but in 10 dimensions it can! This is because the inflexible linear model can’t fit the nonlinear curve in 1d. However, if we let it fit in multiple dimensions, we can gain some expressiveness.
For example, consider raw data that represents the year a house was built. Instead of representing that year as a scalar numeric column, we could split the year into the following four buckets:
Date Range  Represented as… 

< 1960  [1, 0, 0, 0] 
>= 1960 but < 1980  [0, 1, 0, 0] 
>= 1980 but < 2000  [0, 0, 1, 0] 
>= 2000  [0, 0, 0, 1] 
Why would you want to split a number—a perfectly valid input to your model—into a categorical value? Well, notice that the categorization splits a single input number into a fourelement vector. Therefore, the model now can learn four individual weights rather than just one; four weights creates a richer model than one weight. More importantly, bucketizing enables the model to clearly distinguish between different year categories since only one of the elements is set (1) and the other three elements are cleared (0). For example, when we just use a single number (a year) as input, a linear model can only learn a linear relationship. So, bucketing provides the model with additional flexibility that the model can use to learn.
Feature Crosses
Moreover, some variables make more sense when crossing them together to analyze the cooccurrence effect, such as the number of banner a user has seen on a particular device. The “fatigue” effect is heavily dependent on the screen used, and consequently on the device type the user is on. See factorization machines.
Latitude Longitude
Latitude and Longitude as numerical features alone don’t give much discriminative power. An interesting technique that I like is doing bucketed feature crosses. So first you discretize by some type of bucketing to to make lat and long ranges, then you do feature crosses on the ranges to make geo squares!
How do you handle cases where a point on the map is on the boundary of one of your grid squares? Reminds me of the time encoding task….
Embeddings
Learnt dense representation of sparse (categorical) data.
Anytime you have a categorical feature, you can probably learn an embedding
Transformations
Cartesian Space: Timeofday
Encoding the numerical Timeofday as categories from 023 ruins the continuity from 12 am to 1am. Instead consider time a vector in polar space and then convert to cartesian space. This transformation will give you two dimensions in cartesian space.
The naive way to do this is to bucketize into parts of day or to make make a continuous feature out of the hours. But there is a more sophisticated was using trigonometry.
The problem of using hour of day is that it ruins the circular nature of time by adding a discontinuity between 12PM and 1AM. Instead, consider time as an angle w/ a unit vector and then covert that to 2d cartesian space.
The discontinuity is bad b/c we are saying there is a relative importance between 2am12pm and not 12pm1am when there really is!
Technically, this means adding two new features x,y in cartesian space. Now hour 23 and 0 are right next to each other.
import numpy as np
df['hr_sin'] = np.sin(df.hr*(2.*np.pi/24))
df['hr_cos'] = np.cos(df.hr*(2.*np.pi/24))
df['mnth_sin'] = np.sin((df.mnth1)*(2.*np.pi/12))
df['mnth_cos'] = np.cos((df.mnth1)*(2.*np.pi/12))
Dimensionality Reduction
 Feature Selection
 SVD
 PCA
 Random Projections
KMeans Clustering
Clustering method which tries to minimize the incluster SSE. Parameterized by $K$ for the number of clusters/centroids. Very sensitive to the initial location of the centroids.
 Initiation of centroids (random or heuristic – very important)
 Assignment of data points to 1NN centroids
 Update centroids location (center of cluster/mean of cluster points)
 Repeat #2 and terminate upon convergence
Figure: KMeans run for 4 iterations
Linear Regression
R2
Where SEE is the mean squared loss of the optimized model and SST is the MSE loss of the average of the data. 0 is worst, 1 is best.
You cannot blindly trust the R2 score or other summary statisitcs as a measure of fit goodness. Particularly, you cannot ignore the assumptions of a regression, which are:
Assumptions
(Most of these assumptions are in place to ensure the validity of the coefficient weights. For example if you ignored these assumptions and took some meaning from the coefficients then you will be misled.)
 Linear relationship between predictor and response and
 no multicollinearity
 Predictors/features are correlated with each other. Gives wrong coefficient estimates, however, does not effect the response (so ignore it:) )
 No autocorrelation (statistical independence of the errors). Autocorrelation occurs when the residuals are not independent from each other.
 If your errors are not independent, then that means your observations are autocorrelated, then there is multicollinearity which violates the second assumption.
 Bias: Your “best fit line” will likely be way off because it will be pulled away from the “true line” by the effect of the lagged errors.
 Inconsistency: Given the above, your sample estimators are unlikely to converge to the population parameters.
 Inefficiency: While it is theoretically possible, your residuals are unlikely to be homoskedastic if they are autocorrelated. Thus, your confidence intervals and you hypothesis tests will be unreliable.
 Normality of the residuals/error distribution
 Homoscedasticity (stationary/constant variance) of the residuals/error distribution
Least Squares
Closed form/analytical solution to the MSE/squared residuals/SSE error/loss function of linear regression.
Problems:
 Requires all data to do a matrix inversion: not computationally tractable in some cases
 The the MSE loss is sensitive to outliers
 Assumes a Gaussian conditional distribution (continuous), whereas binary target vectors clearly have a distribution that is far from Gaussian (more like binomial), so it’s not fitted for classification, only regression.
Basis Function/Kernel
I think these are called kernels in Scikit, not sure the diff…
Transforms data points. Allows a linear combination in $w$ to learn nonlinear relationships.
Consider the simple linear regression:
Now consider the basis form:
Where $\phi: R^d \rarr R^m$ (this could be nonlinear)
Then we can view the above mentioned simple liner regression just a selection case where the basis function is the identify function $\phi(x)=x$
Example Basis functions:
 Polynomial (degree 2): $\phi(x) = (1, x(1), x(2), x(1)^2 x(2)^2, x(1)x(2))$ for $X \in R^2 $
 Sigmoid:
 Radial: Tranform to polar coordinates. Popular for SVM
 Fourier: DSP
 Wavelet: useful in image processing. Combination of Fourier and RBF.
For example, you cannot learn the XOR function using a linear model w/ an identity basis function. However, you can w/ a polynomial basis function, b/c the $x_1 * x_2$ term is linearly separable in $w$.
Great example here in the scikit docs: https://scikitlearn.org/stable/modules/linear_model.html#polynomialregressionextendinglinearmodelswithbasisfunctions
Figure: Illustration of the mapping φ. On the left a set of samples in the input space, on the right the same samples in the feature space where the polynomial kernel K(x,y) (for some values of the parameters c and d) is the inner product. The hyperplane learned in feature space by an SVM is an ellipse in the input space.
Classification
Discriminate Function: Assigns sample point to a class.
 Binary Classification: $2$ classes
 Multiclass Classification: $K$ classes
 Onevsall: train $K1$ binary classifiers. However, you will see this leads to ambitious decision surfaces
 Onevsone: train $K(K1)/2$ combinations of binary classifiers for $K$. Still has ambitious dec. surfaces
 Multinomial: (this is the obvious choice I would code up before I even know the two other options existed). train 1 classifier w/ matrix of weights for all classes. The output is the dot product of the input $x$ w/ each set of weights for the respective class. So if you have 6 classes, you will get 6 scaler dot product values and the
argmax
corresponds to the winning class. What’s the difference with this and softmax?
Logistic Regression
Where $\sigma$ is an activation function. You can also consider x as a basis function $\phi(x)$ as in Regressions.
The Logistic Regression(classification) coefficients are usually estimated using MLE. Unlike linear regression with normally distributed residuals, it is not possible to find a closedform expression for the coefficient values that maximize the likelihood function, so that an iterative process must be used instead; for example Newton’s method or GD. This process begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this revision until no more improvement is made, at which point the process is said to have converged.
Binary Cross Entropy Loss
Binary xent can be recovered by maximizing the likelihood of a binomial distribution.
Regularization
Regularization is any process to help prevent overfitting in a model.
 Weight decay
 Bagging (Dropout)
 Early Stopping
Ocam’s Razor: Among competing hypothesis, the simplest is the best. In parametric models, simplicity can be considered either:
 Less parameters: L1 Lasso
 Smaller parameters: L2 Ridge
L2 regularization assumes a prior belief that weights should be small and normally distributed around zero. Interesting!
This is an interesting visualization from ISL which shows a red unregularized convex loss function in two dimensions w/ parameters $\beta_1$ and $\beta_2$, where the optimal value is $\hat{\beta}$. It also shows the two constrained regions of the L1 and L2 norm, which are centered around zero and small in magnitude. Thus we can see the optimal regularized optimum is not $\hat{\beta}$ but rather the intersection of the two regions. Notice that this makes both losses have smaller magnitude and in the L1 case makes the solution space sparse as $\beta_1=0$!
So in other words, to summarize, the absolute value’s going to be a constraint region that has sharp corners. In high dimensions, you have edges and corners. And along an edge or a corner, if you hit there, you get a 0. So this is, geometrically, why you get sparsity in the Lasso.
Ensembling
There is no universal model that preforms best on all data. Each model has tradeoffs, advantages/disadvantages against others (see no free lunch principle). In practice, it turns out that instead choosing the best model out of a set of models, if we combine many various, the results are better — often much better — and at little extra effort.
Creating such model ensembles is now standard. In the simplest technique, called
 bagging, we simply generate radom variates of the training set by revamping, learn an overfit classifier on each, and combine the results by voting. This works because it greatly reduces variance while only slightly increasing bias.
 boosting, training examples have weights, and thse are varied so that each new weak classifier can focus on the examples the previous ones tended to get wrong.
 stacking, the outputs of individual classifiers come the inputs of a “higherlevel” larger that figure out how best to combine them.
Tree Models (non parametric)
Decision Tree
Recursively Splits the decision space. Tries to partition the data in such a way to make PURE leafs – homogeneous groups.
 Good for nonlinearlyseperable data, usually next choice after linear regression
 Prone to overfitting (high variance / low bias): Use bagging to alleviate: see random forrest
 Interpretable: visualize decision surface.
 lowerdimensional data is better / high dimensional will result in deep tree (read overfitted)
 Greedy Algorithm: not guaranteed to find the optimal solution.
Entropy
Purity/disorder measurement of set:
where $n$ is the class id.
Information Gain
Information Gain is the average entropy of the children subtracted from the the entropy of the parent :
Where $p$ is the ratio of items in set A.
Algorithm
 For the given branch (root if first iteration)
 Find (exhaustively) feature/column w/ most potential information gain $f_i$
 For each column, For each value, calculate info gain @value, and return largest.
 Split the branch among the target classes with respect to $f_i$
 For each split (2): Repeat Step #1, unless leaf!
 Find (exhaustively) feature/column w/ most potential information gain $f_i$
Entropy Example
Where $p_i$ is the probability of class $i$. In the binary classification case, we can rollout the summation to:
Where p and q are the respective categories.
 What is the entropy of a homogenous set?
 $1 \log_21 = 0$
 What is the entropy of a set w/ a 50/50 mix?
 $(.5\log_2.5)  (.5\log_2.5) = 1$
X  Y  Class 

Green Circle  
Ping Cross  
Green Circle 
Figure: 16 greens and 14 ping = 30 total
What is the potential entropy in the distribution above?
Very high entropy. This is a good column feature/column to split on.
Pruning
Basic Dtrees massively overfit, so pruning helps simplify the tree in order to generalize better
Bagging
 Bootstrap Sampling: Random sample with replacement in order to lower variance (overfitting)
 RF: Averages the results of dtrees overfit on bootstrap samples.
The theory bagging, in a nutshell, is that you have multiple models that you blend together to reduce variance and make your predictions more stable. That’s how a random forest works, it is a combination of n_estimators
decision tree models that use majority voting (in the case of Random Forest Classifier) or straight averaging (in the case of Random Forest Regressor). Random Forests are called Bagging MetaEstimators. Bagging reduces variance by introducing randomness into selection of the best feature to use at a particular split. Bagging estimators work best with when you have very deep levels on the decision trees. The randomness prevents overfitting the model on the training data.
In bagging, we use many overfitted classifiers (low bias but high variance) and do a bootstrap to reduce the variance.
Bagging is short for Bootstrap Aggregation. It uses several versions of the same model trained on slightly different samples of the training data to reduce variance without any noticeable effect on bias. Bagging could be computationally intensive esp. in terms of memory.
The intuition behind bagging is that averaging a set of observations reduces the variance. Given $Z_n$ observations w/ variance $\sigma^2$, the variance of the mean $Z$ is given by $\frac{\sigma^2}{n}$. Hence a natural way to reduce the variance and hence increasing the predicting accuracy of a model is to take many samples of the training set and average the resulting predictions.
Examples: Random Forrest
Boosting
 Lowers bias while maintaining variance.
In boosting, we allow many weak classifiers (high bias with low variance) to learn form their mistakes sequentially with the aim that they can correct their high bias problem while maintaining the lowvariance property.
Examples: AdaBoost, GradientBoost
AdaBoost
Boosting relies on training several (simple, usually decision stumps) models successively each trying to learn from the errors of the models preceding it. Boosting differes from other algorithms in that in addition to it gives weights to training examples (as opposed to linear models which apply weights on features). So in essence we can weight the scarcer observations more heavily than the more populous ones. Boosting decreases bias and hardly affects variance (unless you are very sloppy). Depending on your n_estimators
paramenter you are adding another innerloop to your training step, so the the price of AdaBoost is an exensive jump in computational time and memory.
Kernels
Making data linearly separable by transforming it
https://www.quora.com/Whatis“kernelmethods”
SVM
Maximum Margin Classifier. Tries to find 2 points, supports, which are the furthest away from the decision boundary. Trying to maximize the width of the street. Mathematical guarantee to find global maxima.
Bias/Variance Tradeoff
If you model is overfitting, then it is highvariance: learning noise. If it is underfitting, it is too biased and not capturing the signal.
Often this is quantified by looking at the Learning Curves:
How to remove variance (overfitting):
 Get more data
 Regularization
 Reduce (count/magnitude) of Parameters
 Bagging: Use less flexible model (eg linear)
 Early Stopping
 Also see data augmentation and batch norm*
How to remove bias (underfitting):

Get more data

More flexible model/parameters (eg Boosting)
Evaluation
 Precision: ability of the classifier not to label as positive a sample that is negative $\frac{TP}{ TP + FP}$
 Recall: ability of the classifier to find all the positive samples. (Min false negatives) $\frac{TP}{TP+FN}$
 PrecisionRecall is a useful measure of success of prediction when the classes are very imbalanced.
 PR Curve: P&R at different thresholds cutoff thresholds in the binary classifier. The precisionrecall curve shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall). precisionrecall curves are appropriate for imbalanced datasets
 High Recall (Optimistic System): A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels
 High Precision (Pessimistic System): A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels.
ROC
ROC curves are appropriate when the observations are balanced between each class
AUC
An evaluation metric that considers all possible classification thresholds.
The Area Under the ROC curve is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive
Questions
 Linear Regression Assumptions:
 Linearity
 Constant variance of residual (stationarity)
 Normal distribution of variance with mean = 0
 What is stationary signal?:
 one which has a constant mean and variance
 Write out a function to calculate the AUC of an ROC curve
 The integral of the curve
Optimization
0th Order Black Box
Only $f(x)$ can be evaluated
1st Order Gradient Based
$\nabla{f(x)}$ Can be evaluated
Vanilla GD has fixed shared learning rates across parameters/dimensions. Later innovations brought momentum. The latest innovations are in adaptive and individual learning rates across dimensions.
 Learning Rate: scales the gradient. Therefore the higher the learning rate, the more confidence you have in the quality/accuracy of your gradient. The higher the learning rate the faster you can possibly converge.
 Batch Size: The was an interesting analogy I heard about getting advice from a drunk person for directions. If he was very drunk you would only take little confidence in his directions, you’d go and ask the next person. However, if it was a sober person, you would have higher confidence in their direction and put more weight on it. The size of your batch is like how drunk the person is. A large batch is a sober person and small batch is a drunk person: $\abs{B}\propto \lambda$ . However, as your batches get smaller, the gradients have a lot more noise, which some may say will help not getting stuck in local minima.
Figure: chart from fastAI showing Loss correlation with LR and Batch size. Largest batch is blue which shows the smoothest curve and highest (optimal) learning rate. Purple is smallest batch size and you can see it has the lowest optimal learning rate.
Vanilla
(Vanilla) Gradient Descent is very sensitive to step size. In illconditioned setting, it’s very useful to have an adaptive step size: in regions of large variability of the gradient we would like to take small step sizes, while on regions with small variability we would like to take large step sizes.
Momentum
W/o Momentum  W/ Momentum 

Momentum [5] is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Image 3. It does this by adding a fraction $\gamma$ of the update vector of the past time step to the current update vector:
where $v_t$ can be seen as a exponentially decaying average of past gradients.
Adagrad
Adagrad is an adaptive learning rate method originally proposed by [Duchi et al]. Adapts the learning rate to the parameters, performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is wellsuited for dealing with sparse data.
Independent parameter learning rates.
Adadelta/RMSprop
Fixes the diminishing gradient problem w/ Adagrad.
Adam
In addition to storing an exponentially decaying average of past squared gradients $v_t$ like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients $m_t$, similar to momentum.
2nd Order
$\nabla^2{f(x)}$ Can be evaluated
Kfac, Newton, GaussNewton, QuasiNewton, (L)BFGS
Newton’s Method
Newton’s method can be used to find a minimum or maximum of a function $f(x)$: .
The derivative is zero at a minimum or maximum, so local minima and maxima can be found by applying Newton’s method to the derivative. The iteration becomes:
LGBFS: batch mode only. Unlike minibatch SGD, getting LBFGS to work on minibatches is more tricky and an active area of research.
Neural Networks
Function Approximtors.
Multilayer perceptron is a misnomer. It should be multilayer logistic regression!
B/C of the nonlinear activations, we don’t have a solve the nonconvex problem iteratively.
Deep Learning
 Batch Norm: Normalization at inner layers (helps overfitting and convergence time – higher learning rates). When training Deep Neural Networks (DNNs), it is common practice to normalize or whiten features by shifting them to have zero mean and scaling them to have unit variance. It works by internally maintaining an exponential moving average of the batchwise mean and variance of the data seen during training. The main effect of batch normalization is that it helps with gradient propagation—much like residual connections—and thus allows for deeper networks. Typically used after a convolutional or densely connected layer.
 How does Batch Normalization Help Optimization? (NeurIPS 2018): Argues that ICS (shift) has nothing to do w/ BN, but rather, it helps smooth the gradients and loss surface. https://twitter.com/aerinykim/status/1070188133810139136
 The NIPS loss surface paper form Washington that was on Arxive for a while shows how batch norm (and skip connections) smooth out the loss surface.
 Dropout: Regularization technique. Dropout is one of the most effective and most commonly used regularization techniques for neural networks, developed by Hinton and his students at the University of Toronto. Dropout, applied to a layer, consists of randomly “dropping out” (i.e. set to zero) a number of output features of the layer during training. The “dropout rate” is the fraction of the features that are being zeroedout; it is usually set between 0.2 and 0.5. At test time, no units are dropped out, and instead the layer’s output values are scaled down by a factor equal to the dropout rate, so as to balance for the fact that more units are active than at training time.
 Ensembling Analogy: Dropout can been seen as an ensembling technique using bagging.
 CNNs: signal processing technique that is more parameter efficient than a FC.
 Activations: Introduce nonlinearities, otherwise layers are pointless as can be combined into 1 linear operation
 Gradient Accumulation: If you cannot fit large batch sizes on your GPU, you can just average gradients over runs. Then apply them.
Going Deep
In theory, the deeper you go the more expressive power you can achieve, however there are tradeoffs. Diminishing Gradients: Backpropagated errors become very small after a few layers, which makes learning ineffective.
 Skipconnections: The general idea is to reduce the distance between the signal coming from the networks loss and each individual layer. The way this is done is by adding a residual/direct connection between every layer and its predecessors. That way, the gradient can flow from each layer to its predecessors directly.
 This can be solved by pretraining, e.g. starting with initial weights that are close to the final solution.
 Dilated Convolutions
CNNs
CNNs are motivated by 3 factors:
 Locality: neighbor relationships in spatial data (images, audio, sequences) are ignored by FFNs (see: local receptive fields below)
 Invariance: FFNs are not translation and scale invariant. They will memorize orientation and size. To mitigate this you would have to heavily regularize or do data augmentation of all possible combinations of orientation translations and size scales. (See subsampling below)
 Parameter Scale: Fully Connected Networks don’t scale well especially to image data: e.g.
[200 x 200 x 3]
image, would lead to neurons that have200x200x3 = 120,000
weights. Multiple layers of this would lead to overfitting due to too many parameters. A CNN layer on the same input volume would only need to learn10x5x5x3 = 750
parameters! (Assuming 10[5 x 5]
filters over 3 channels ). (See Weight sharing below)
These limitations are remedied in CNNs by:
 Local receptive fields
 Weight sharing:
 subsampling: Pooling
Parameters:
 Input volume size: $W$
 Receptive field size: $F$
 Stride: $S$
 Zero padding: $P$
 FIlter count: $K$
The figure above is a great example of the advantage that weight sharing in CNNs give. 6
[5 x 5]
filters on a [32 x 32 x 3]
image give a give a [28 x 28 x 6]
activation map with only 5 x 5 x 3 x 6 = 450
parameters. Compared to 14M
parameters if we flattened the image and did it w/ a fully connected layer!
Parameters
 Number of
Filters
$f$  Kernel Size of
Filter
$s$
Settings:
 Stride: how to move the window
 Padding:
Receptive Field: receptive field grows linearly w/ CNN layers. Dilated Convs let us grow receptive field exponentially w/ depth (more efficient in terms of params).
Each “A” sees two inputs. Each “B” sees two “As” which have a common input so each “B” has a receptive field of 3. Each B is exposed to exactly 3 inputs. In other words, receptive field grows linearly w/ layers.
Such 1D convnets can be competitive with RNNs on certain sequenceprocessing
problems, usually at a considerably cheaper computational cost. Recently, 1D conv nets, typically used with dilated kernels, have been used with great success for audio
generation and machine translation. In addition to these specific successes, it has long
been known that small 1D conv nets can offer a fast alternative to RNNs for simple tasks
such as text classification and timeseries forecasting.
Causal
Like LSTMs, causal convolutions can model sequences with longterm dependencies. This is achieved in two ways: stacking convolutional layers (with padding, every convolutional layer preserves the shape of the input), and dilation: insertion of gaps into the convolutional filters (otherwise known as atrous convolutions). convolutional filters applied to the sequence in a lefttoright fashion, emitting a representation at each step. They are causal in that the their output at time t
is conditional on input up to t1
: this is necessary to ensure that they do not have access to the elements of the sequence we are trying to predict
Spotlight CNN Sequential model: https://maciejkula.github.io/spotlight/sequence/representations.html#spotlight.sequence.representations.CNNNet
The convolution is causal because future states are never part of the convolution’s receptive field; this is achieved by leftpadding the sequence.
Dilation: setting this to a number greater than 1 inserts gaps into the convolutional layers, increasing their receptive field without increasing the number of parameters. Stacked dilated convolutions enable networks to have very large receptive fields with just a few layers, while preserving the input resolution throughout the network as well as computational efficiency.
Dilated convolutions AKA atrous convolutions AKA convolutions with holes are another method of increasing the receptive field without angering the gradient gods. When we looked at stacking layers so far, we saw that the receptive field grows linearly with depth. Dilated convolutions let us grow the receptive field exponentially with depth.
Here we have a 3X3 filter applied with a dilation of 1,2 and 3. With a dilation of 1 we have a standard convolution. With a dilation of 2 we apply the same 3X3 filter but use every second pixel
Oord, Aaron van den, et al. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (2016).
Separable
P262 of the Keras book
Capsules
CNNs are translation invariant, however, we want them go equivariant!
One of the goals of subsampling/pooling is to introduce invariance to the model. Meaning it tries to make the neural activities invariant to small viewpoint changes.
Input > Conv > relu > pooling
Subsampling or pooling in convents looses the spatial relationships in the images. for example a face isn’t an arbitrary combination of eyes + nose + ears + mouth, it’s a specific arrangement of the pieces.
Pooling throws away information about the precise position of the entity within the region. Pooling reports the activity of the most active neurons and then forgets where they are, which gives you a small amount of translational variance.
Capsules are also robust against adversarial attacks b/c they learn the distribution of each class. This can be used to check for adversarial input from a different distribution.
Image Classification w/ Capsules
1 capsule/class in the top layer. so most examples only have 10 capsules (eg mnist) in first layer, but ImageNet would have 1000. it is at least O(n) where n is the number of classes. routing by aggreement causes it to be slow to train b/c of the inner loop.
RNNs
Stateful layers:
 RNN
 GRU
 LSTM
https://medium.com/@TalPerry/convolutionalmethodsfortextd5260fd5675f
Feed characters into layers of RNNs and get final state at end that represents the sequence.
LSTMs (vanishing gradients)
With a basic RNN cell, we see a massive drop in performance when it comes to long sequences and the network needs to remember patterns which have occurred way at the beginning to infer things correctly at a current time step. And this is because of exploding and vanishing gradients.
LSTMs, which can remember information from the way past and also selectively forget stuff that is not required.
Residual/Skip Connections
“How do I make my network very deep without losing the gradient signal?”
The general idea is to reduce the distance between the signal coming from the networks loss and each individual layer. The way this is done is by adding a residual/direct connection between every layer and its predecessors. That way, the gradient can flow from each layer to its predecessors directly.
Transfer Learning
Learning in one domain and transferring it to another.
Domain Transfer/Adaptation
One application of this is learning from a simulation .
Applied ML
Recommender Systems
Representation Learning. Most popular Rec Sys these days can be considered Representation Learning approaches. They all learn some dense representation of items and then uses that data structure to make predictions. This has a nice auxiliary product of similarity tables, however, it is also not directly modeling the recommendation task.
 Representation Learning
 ItemItem (Lindon. AMZN). Learns a symmetric itemitem similarity table.
 MF: Latent Factor Models. Learns User and Item embeddings. No concept of time.
 ALS MF
 Pairwise $P(i > j \given U)$ Ranking objectives
 BPR
 Next Item Models: $P(i_{t+1} \given U, C)$. Better objective more aligned w/ most rec tasks. Weak concept of time.
####
IR
 TwoPhase System: First stage is candidate generation from full corpus: high precision/fast response. The second stage is a high recall ranker which is more complex and uses more data sources.
 Personalized Search: The First stage can be a recommender system or a search engine which is personalized. The second stage can also use personalization give a user and context.
LTR
supervised task to rerank a list for optimal order. Data can be relevance judgements or log data. Log data gives you implicit ranking preference from clickstreams. With clickstream data you can either model clicks or ranking.
Architectures:
 RankNet: siamese network w/ AUC objective $P(i>j)$
 LambdaRank: directly optimizes NDCG objective
 YouTube: models user video watch time as ranker (not LTR per se)
Unbiased Bandit Data
Approaches that use log data, such as RecSys and LTR have to deal w/ a bias issues. This data, also called Bandit Data, is generated by a system. Compare this to the classical supervised learning approach w/ groundtruth labels, log data is biased by the system from which it is made.
Object2Vec
Distributed Representation/Representation Learning are popular across all fields currently. It all started w/ word2vec where items are clustered in an embedding space with respect to how similar they are in some $\text{context}$. These dense representations are then commonly used for downstream tasks like features for classification.
 Uber learns sentence embeddings using doc2vec where a sentence is from a chat message. Not sure what is the context here. And how to use it of unobserved sentences….
Word2Vec
Given a context, surrounding words, predict the target word. Error propagates back down to the embeddings.
 CBow: given context, predict target $P( T \given C)$
 Skipgram: given target, predict context $P(C\given T)$
 Glove
Figure: Skipgram architecture.
The figure above w/ the context “the cat sits on the” and the target “mat” can be generalized into many general use cases. For example Uber Eats learns embeddings where the context is the search query and the target is the item the user converted. For example: “Dan Dan Noodles” and {Item A}
Candidate Sampling
Softmax over 5000 classes is slow.
So if you have 5000 classes to evaluate the softmax layer you need to do 5000 forward passes through your function $f$ fro the different values of $x_i$. Computational Complexity: $O(\abs{V})$
This has motivated work into approximate softmax methods often called Candidate Sampling:
 Hierarchical Softmax
 Sampled Softmax
 Noice Contrastive Estimate (NCE)
Text Classification
Given a sentence model the probability of a it’s label $P(y\given X)$ where x is a variable length list of words.
Nice guide here: https://developers.google.com/machinelearning/guides/textclassification/
Because bagofwords isn’t an orderpreserving tokenization method (the tokens gen
erated are understood as a set, not a sequence, and the general structure of the sen
tences is lost), it tends to be used in shallow languageprocessing models rather than
in deeplearning models. Extracting ngrams is a form of feature engineering, and
deep learning does away with this kind of rigid, brittle approach, replacing it with hier
archical feature learning.
BOW
Classic approach which encodes the sentences/observations into fixed length feature vector length $\abs{V}$ . The encoding can be simple indicators for a word or counts or more popularly tfidf counts.
Once you have your fixed length feature, you can use any type of classifier.
Sequential
Instead of a fixed length feature vector w/ frequency information, we learn/use an embedding for each word/token.
 NGram: groupings of sequential words of Len N
 BOW: Fixed length feature vector of Ngram counts over $V$.
 Independence between BOW features is assumed, however, that is mitigated a bit by the fact the BOW features may infect be NGrams which will encode some dependence (sequence info)
 FeedForward:
 Ngrams: fixed length feature vector of ngrams over $V$ (very high dimensional)
 BOW: fixed length feature vector of averaged ngram embeddings (low dimensional)
 RNN: feed embeddings to RNN layers
 CNN: Convolve over ngram sequence of embeddings
Figure: A sequential Feedforward MLP(DNN). Note for the 3D embedding layer: Although the diagram is ambiguous, the embeddings for each activated movie are either added or averaged.
One strategy to combine the speed and lightness of convnets with the ordersensitivity of RNNs is to use a 1D convnet as a preprocessing step before an RNN (see figure 6.30). This is especially beneficial when you’re dealing with sequences that are so long they can’t realistically be processed with RNNs, such as sequences with thousands of steps. The convnet will turn the long input sequence into much shorter (downsampled) sequences of higherlevel features. This sequence of extracted features then becomes the input to the RNN part of the network.
NLP
 Named Entity Recognition
 Coreference
 Part of Speech Tagging
Anomaly Detection
Isolation Forests: Unsupervised. Overfit a a bunch of trees (forest) and then the average path length to a point is called the anomaly score. Put in a threshold and you have a classifier. The random partitioning, on average, creates shorter paths for anomalies/outliers, which is intuitive b/c it should be hard to separate inliers.
Permalink: machinelearningglossary
Tags: