Machine Learning Glossary

Alex Egg,

Information Theory

Shannon Entropy

Information entropy is the average rate at which information is produced by a stochastic source of data.

Where $p_i$ is the probability of class $i$. In the binary classification case, we can rollout the summation to:

Where p and q are the respective categories.


Figure: Distribution of entropy for binary random variable (coin). For a fair coin p=.5, there is maximum entropy.

Recursive DFS tree traversal

def traverse(node, visited):
    for node in node.children - visited:
		traverse(node, visited)

    return #leaf

traverse(root, set())

Recursive BFS tree traversal:

q = []
def traverse(node, visited):
    while q:
        node = q.pop()
        for node in node.children - visited:

traverse(root, set())

Linear Algebra


Measurement of the magnitude of a vector. Typically used for quantifying model parameter complexity, where complexity is viewed as either vector magnitude or sparseness.

Dot Product

Cosine: normalized dot product (angle between two vectors)


Distance measure between two vectors $A​$ and $B​$



Bayes Rule

Example: There is a cancer test that is 75% accurate (TP Rate/sensitivity):

The prior probably of having cancer in general is $P(Y=1)=.01​$

The probability of a positive test $P(X=1)​$:

What is the posterior probably $P(Y\given X)​$ that you have cancer if you have a positive on the test?

Expectation Notation

For an unfair or weighted coin, the two outcomes are not equally likely. You can change the weight or distribution of the coin by dragging the true probability bars (on the right in blue) up or down. If we assign numbers to the outcomes — say, 1 for heads, 0 for tails — then we have created the mathematical object known as a random variable.

The expectation of a random variable is a number that attempts to capture the center of that random variable’s distribution. It can be interpreted as the long-run average of many independent samples from the given distribution. More precisely, it is defined as the probability-weighted sum of all possible values in the random variable’s support.

Which equals 3.5 for a dice: $1/6 + 21/6… +61/6$

Variance: Whereas expectation provides a measure of centrality, the variance of a random variable quantifies the spread of that random variable’s distribution. The variance is the average value of the squared difference between the random variable and its expectation,



Experiment Design


Figure: A normal/t distribution for the null hypothesis where $\mu=0$ w/ 2 tails highlighted being the regions the sample $\mu$ must lie for the p value to be below 0.05. You can see the sample mean lies between, -2 and -1 and the goal of the test is to see if this is “weird” or not. It is weird if it is in the tail.

Statistical Tests




MLE Maximum Likelihood Estimation

Method to fit data to a distribution. Goodfellow (2016) gushes over it.

When you try MLE on a gaussian you get MSE. When you try MLE on a Binomial you get binary classification. This means that implicitly when you use MSE, you are doing regression (estimation) and when you use CE, you are doing classification.

MLE is a special case of Maximum A-posteriori Estimation (MAP) where the prior is uniform.


given some measurements (ie weight of rabbits), get the normal distribution likelihood function which is parameterized by $\mu$ and $\sigma$ and then maximize the likelihood function to find the estimates for the parameters that fit the data.

MLE works down to:


Given some bernuli samples (ie free throws ) try to fit the binomial distribution, which is paramteramized by $p​$, to the data. Get the PMF of the Bernoulli Distribution: $f(x_i \given p) = p^x (1-p)^{1-x}​$ .

The Likelihood function $L(p\given x)​$ is the product of each data point given the parameter:

Take the log for connivence down the road:

And you have recovered binary cross entropy ^^^^

Take derivative w/ respect to $p​$ and solve for 0.


Random sample of the number of textbooks that a person buys for a semester at uni.

Poisson Dist is parameterized by $\lambda$ . PDF:

Now compose Likelihood function:

Log and take derivative at 0 to get the MLE estimator of $\hat\lambda_{MLE} =\frac{ \sum x_i}{ n}$

MAP Estimation

MLE is a special case of maximum a posteriori estimation (MAP) that assumes a uniform prior distribution of the parameters.

Includes Prior knowledge!

Bayes Rule:

Where $P(\Theta)$ can be a uniform distribution and thus ignored and $P(D)$ is considered a constant and thus ignored.

Or equivalently:

Where the first part is the posterior and the second part is the likelihood prior.

Expectation Maximization

Lambda Loss Paper uses this….

A way to calculate MLE/MAP

EM is usually useful when it’s hard to calculate the gradient of the log-likelihood. In cases I’m familiar with, this is due to the presence of latent variables per datapoint. You see it in Gaussian mixture models, hidden Markov models, and other problems. In order to avoid calculating the gradient, you iteratively:

This is guaranteed to converge to a local optimum.

Time Series Analysis

ARIMA: Auto Regressive …

Autocorrelation (denoted by an ACF plot) is the actual correlation between a data point and its lag (some period prior). When this condition exists and may call for more advanced modeling than simple Linear Regression.

Autoregression is a term used to describe the process/model (aka Autoregressive model) when one value may be affected by prior values.

AR models, meaning regression models that use the past as input features, are no different from regression with time-unrelated features. What we are doing is just using previous y-values as input for a given point. We can manually construct such features directly.

When we add a 2-lag feature (meaning grosses from two weeks ago & last week in our case), the model is called AR-2 (AR for AutoRegressive). When we use up to three, it is called AR-3 and so on. Let’s do an AR-2.


Class Imbalance

Good to look at recall

Feature Engineering

BOW (Vocabularies)

When you have a variable list of features (like words in a sentence, or number of products a user liked), then a common pattern to is transform this into a fixed length feature vector across the whole vocabulary $V$.

Buckets (Continuous Features)

Continuous features don’t need extra treatment, however, signal can be gained by knowing their properties. These tricks can help you stay on a linear model :)

Looking at the distribution of features, we can define relevant buckets for continuous variables, such as counters or timestamps. For instance, consider the number of products a user has seen. Seeing four products definitely reflects more engagement than seeing only one product. However, seeing a hundred and four products is not that different from seeing a hundred products, hence the need to bucketize continuous variables.

Interesting case study from scikit:

In 1D the linear regression can’t fit the data, but in 10 dimensions it can! This is because the inflexible linear model can’t fit the non-linear curve in 1d. However, if we let it fit in multiple dimensions, we can gain some expressiveness.

For example, consider raw data that represents the year a house was built. Instead of representing that year as a scalar numeric column, we could split the year into the following four buckets:


Date Range Represented as…
< 1960 [1, 0, 0, 0]
>= 1960 but < 1980 [0, 1, 0, 0]
>= 1980 but < 2000 [0, 0, 1, 0]
>= 2000 [0, 0, 0, 1]

Why would you want to split a number—a perfectly valid input to your model—into a categorical value? Well, notice that the categorization splits a single input number into a four-element vector. Therefore, the model now can learn four individual weights rather than just one; four weights creates a richer model than one weight. More importantly, bucketizing enables the model to clearly distinguish between different year categories since only one of the elements is set (1) and the other three elements are cleared (0). For example, when we just use a single number (a year) as input, a linear model can only learn a linear relationship. So, bucketing provides the model with additional flexibility that the model can use to learn.

Feature Crosses

Moreover, some variables make more sense when crossing them together to analyze the co-occurrence effect, such as the number of banner a user has seen on a particular device. The “fatigue” effect is heavily dependent on the screen used, and consequently on the device type the user is on. See factorization machines.


Learnt dense representation of sparse (categorical) data.

Anytime you have a categorical feature, you can probably learn an embedding

KMeans Clustering

Clustering method which tries to minimize the in-cluster SSE. Parameterized by $K$ for the number of clusters/centroids. Very sensitive to the initial location of the centroids.

  1. Initiation of centroids (random or heuristic – very important)
  2. Assignment of data points to 1NN centroids
  3. Update centroids location (center of cluster/mean of cluster points)
  4. Repeat #2 and terminate upon convergence


Figure: KMeans run for 4 iterations

Linear Regression


Where SEE is the mean squared loss of the optimized model and SST is the MSE loss of the average of the data. 0 is worst, 1 is best.

You cannot blindly trust the R2 score or other summary statisitcs as a measure of fit goodness. Particularly, you cannot ignore the assumptions of a regression, which are:

(Most of these assumptions are in place to ensure the validity of the coefficient weights. For example if you ignored these assumptions and took some meaning from the coefficients then you will be misled.)

Least Squares

Closed form/analytical solution to the MSE/squared residuals/SSE error/loss function of linear regression.


Basis Function/Kernel

I think these are called kernels in Scikit, not sure the diff…

transforms data points. Allows a linear combination in $w$ to learn non-linear relationships.

Consider the simple linear regression:

Now consider the basis form:

Where $\phi: R^d \rarr R^m$ (this could be non-linear)

Then we can view the above mentioned simple liner regression just a selection case where the basis function is the identify function $\phi(x)=x$

Example Basis functions:

For example, you cannot learn the XOR function using a linear model w/ an identity basis function. However, you can w/ a polynomial basis function, b/c the $x_1 * x_2$ term is linearly separable in $w$.

Great example here in the scikit docs:

Figure: Illustration of the mapping φ. On the left a set of samples in the input space, on the right the same samples in the feature space where the polynomial kernel K(x,y) (for some values of the parameters c and d) is the inner product. The hyperplane learned in feature space by an SVM is an ellipse in the input space.


Discriminate Function: Assigns sample point to a class.

Logistic Regression

Where $\sigma$ is an activation function. You can also consider x as a basis function $\phi(x)$ as in Regressions.

The regression coefficients are usually estimated using maximum likelihood estimation.[26] Unlike linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximize the likelihood function, so that an iterative process must be used instead; for example Newton’s method. This process begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this revision until no more improvement is made, at which point the process is said to have converged.[26]

Binary Cross Entropy Loss

Binary x-ent can be recovered by maximizing the likelihood of a binomial distribution.


Ocam’s Razor: Among competing hypothesis, the simplest is the best. In parametric models, simplicity can be considered either:

L2 regularization assumes a prior belief that weights should be small and normally distributed around zero. Interesting!

This is an interesting visualization from ISL which shows a red unregularized convex loss function in two dimensions w/ parameters $\beta_1$ and $\beta_2$, where the optimal value is $\hat{\beta}$. It also shows the two constrained regions of the L1 and L2 norm, which are centered around zero and small in magnitude. Thus we can see the optimal regularized optimum is not $\hat{\beta}$ but rather the intersection of the two regions. Notice that this makes both losses have smaller magnitude and in the L1 case makes the solution space sparse as $\beta_1=0$!

So in other words, to summarize, the absolute value’s going to be a constraint region that has sharp corners. In high dimensions, you have edges and corners. And along an edge or a corner, if you hit there, you get a 0. So this is, geometrically, why you get sparsity in the Lasso.


There is no universal model that preforms best on all data. Each model has tradeoffs, advantages/disadvantages against others (see no free lunch principle). In practice, it turns out that instead choosing the best model out of a set of models, if we combine many various, the results are better — often much better — and at little extra effort.

Creating such model ensembles is now standard. In the simplest technique, called

Tree Models (non parametric)

Decision Tree

Recursively Splits the decision space. Tries to partition the data in such a way to make PURE leafs – homogeneous groups.


How do you measure purity/disorder? Entropy:

Where $p_i$ is the probability of class $i$. In the binary classification case, we can rollout the summation to:

Where p and q are the respective categories.

X Y Class
    Green Circle
    Ping Cross
    Green Circle


Figure: 16 greens and 14 ping = 30 total

What is the potential entropy in the distribution above?

Very high entropy. This is a good column feature/column to split on.

Information Gain

Information Gain is the Average Entropy of the children subtracted from the the entropy of the Parent


  1. For the given branch (root if first iteration)
    1. Find (exhaustively) feature(column) w/ most potential information gain $f_i$
      1. For each column, For each value, calculate info gain, and return largest.
    2. Split the branch among the target classes with respect to $f_i$
    3. For each split (2): Repeat Step #1, unless leaf!

Basic D-trees massively overfit, so pruning helps simplify the tree in order to generalize better


The theory bagging, in a nutshell, is that you have multiple models that you blend together to reduce variance and make your predictions more stable. That’s how a random forest works, it is a combination of n_estimators decision tree models that use majority voting (in the case of Random Forest Classifier) or straight averaging (in the case of Random Forest Regressor). Random Forests are called Bagging Meta-Estimators. Bagging reduces variance by introducing randomness into selection of the best feature to use at a particular split. Bagging estimators work best with when you have very deep levels on the decision trees. The randomness prevents overfitting the model on the training data.

In bagging, we use many overfitted classifiers (low bias but high variance) and do a bootstrap to reduce the variance.

Bagging is short for Bootstrap Aggregation. It uses several versions of the same model trained on slightly different samples of the training data to reduce variance without any noticeable effect on bias. Bagging could be computationally intensive esp. in terms of memory.

The intuition behind bagging is that averaging a set of observations reduces the variance. Given $Z_n$ observations w/ variance $\sigma^2$, the variance of the mean $Z$ is given by $\frac{\sigma^2}{n}$. Hence a natural way to reduce the variance and hence increasing the predicting accuracy of a model is to take many samples of the training set and average the resulting predictions.

Examples: Random Forrest


In boosting, we allow many weak classifiers (high bias with low variance) to learn form their mistakes sequentially with the aim that they can correct their high bias problem while maintaining the low-variance property.

Examples: AdaBoost, GradientBoost


Boosting relies on training several (simple, usually decision stumps) models successively each trying to learn from the errors of the models preceding it. Boosting differes from other algorithms in that in addition to it gives weights to training examples (as opposed to linear models which apply weights on features). So in essence we can weight the scarcer observations more heavily than the more populous ones. Boosting decreases bias and hardly affects variance (unless you are very sloppy). Depending on your n_estimators paramenter you are adding another inner-loop to your training step, so the the price of AdaBoost is an exensive jump in computational time and memory.


Making data linearly separable by transforming it“kernel-methods”


Maximum Margin Classifier. Tries to find 2 points, supports, which are the furthest away from the decision boundary. Trying to maximize the width of the street. Mathematical guarantee to find global maxima.

Bias/Variance Tradeoff

If you model is overfitting, then it is high-variance: learning noise. If it is underfitting, it is too biased and not capturing the signal.

Often this is quantified by looking at the Learning Curves:


How to remove variance (overfitting):

How to remove bias (underfitting):

bias/variance visual analogy




ROC curves are appropriate when the observations are balanced between each class


An evaluation metric that considers all possible classification thresholds.

The Area Under the ROC curve is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive



0th Order Black Box

Only $f(x)$ can be evaluated

1st Order Gradient Based

$\nabla{f(x)}$ Can be evaluated

Vanilla GD has fixed shared learning rates across dimensions. Later innovations brought momentum. The latest innovations are in adaptive and individual learning rates across dimensions.


Figure: chart from fastAI showing Loss correlation with LR and Batch size. Largest batch is blue which shows the smoothest curve and highest (optimal) learning rate. Purple is smallest batch size and you can see it has the lowest optimal learning rate.

(Vanilla) Gradient Descent is very sensitive to step size. In ill-conditioned setting, it’s very useful to have an adaptive step size: in regions of large variability of the gradient we would like to take small step sizes, while on regions with small variability we would like to take large step sizes. If the function evaluated at the new gradient is smaller than current function value (minus a the norm of the gradient) then, then the current LR is maintained. Else wise, it’s reduced to deal w/ the high variance loss surface.

Figure: Gradient Descent w/ global adaptive LR (Backtracking line search)

2nd Order

$\nabla^2{f(x)}$ Can be evaluated

Kfac, Newton, Gauss-Newton, Quasi-Newton, (L)BFGS

Newton’s Method

Newton’s method can be used to find a minimum or maximum of a function $f(x)$: .

The derivative is zero at a minimum or maximum, so local minima and maxima can be found by applying Newton’s method to the derivative. The iteration becomes:

Neural Networks

Function Approximtors.

Multi-layer perceptron is a misnomer. It should be multi-layer logistic regression!

B/c of the non-linear activations, we don’t have a solve the non-convex problem iteratively.

Deep Learning

In theory, the deeper you go the more expressive power you can achieve, however there are tradeoffs:


CNNs are motivated by 3 factors:

  1. Locality: neighbor relationships in spatial data (images, audio, sequences) are ignored by FFNs (see: local receptive fields below)
  2. Invariance: FFNs are not translation and scale invariant. They will memorize orientation and size. To mitigate this you would have to heavily regularize or do data augmentation of all possible combinations of orientation translations and size scales. (See subsampling)
  3. Parameter Scale: Fully Connected Networks don’t scale well especially to image data: e.g. 200x200x3 image, would lead to neurons that have 200x200x3 = 120,000 weights. Multiple layers of this would lead to overfitting due to too many parameters. A CNN layer on the same input volume would only need to learn 10x5x5x3 = 750 parameters! (Assuming 10 5x5 filters over 3 channels ). (See Weight sharing below)

These limitations are remedied in CNNs by:

  1. Local receptive fields
  2. Weight sharing
  3. subsampling



The figure above is a great example of the advantage that weight sharing in CNNs give. A 6 5x5 filters on a 32x32x3 image give a give a 28x28x6 activation map with only 5x5x3x6=450 parameters. Compared to 14M parameters if we flattened the image and did it w/ a fully connected layer!



Each “A” sees two inputs. Each “B” sees two “As” which have a common input so each “B” has a receptive field of 3. Each B is exposed to exactly 3 inputs. In other words, receptive field grows linearly w/ layers.

Such 1D convnets can be competitive with RNNs on certain sequence-processing
problems, usually at a considerably cheaper computational cost. Recently, 1D conv-
nets, typically used with dilated kernels, have been used with great success for audio
generation and machine translation. In addition to these specific successes, it has long
been known that small 1D convnets can offer a fast alternative to RNNs for simple tasks
such as text classification and timeseries forecasting.


Like LSTMs, causal convolutions can model sequences with long-term dependencies. This is achieved in two ways: stacking convolutional layers (with padding, every convolutional layer preserves the shape of the input), and dilation: insertion of gaps into the convolutional filters (otherwise known as atrous convolutions). convolutional filters applied to the sequence in a left-to-right fashion, emitting a representation at each step. They are causal in that the their output at time t is conditional on input up to t-1: this is necessary to ensure that they do not have access to the elements of the sequence we are trying to predict

Spotlight CNN Sequential model:

The convolution is causal because future states are never part of the convolution’s receptive field; this is achieved by left-padding the sequence.

Dilation: setting this to a number greater than 1 inserts gaps into the convolutional layers, increasing their receptive field without increasing the number of parameters. Stacked dilated convolutions enable networks to have very large receptive fields with just a few layers, while preserving the input resolution throughout the network as well as computational efficiency.

Dilated convolutions AKA atrous convolutions AKA convolutions with holes are another method of increasing the receptive field without angering the gradient gods. When we looked at stacking layers so far, we saw that the receptive field grows linearly with depth. Dilated convolutions let us grow the receptive field exponentially with depth.

Here we have a 3X3 filter applied with a dilation of 1,2 and 3. With a dilation of 1 we have a standard convolution. With a dilation of 2 we apply the same 3X3 filter but use every second pixel

Oord, Aaron van den, et al. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (2016).


P262 of the Keras book


CNNs are translation invariant, however, we want them go equivariant!

One of the goals of subsampling/pooling is to introduce invariance to the model. Meaning it tries to make the neural activities invariant to small viewpoint changes.

Input > Conv > relu > pooling

Subsampling or pooling in convents looses the spatial relationships in the images. for example a face isn’t an arbitrary combination of eyes + nose + ears + mouth, it’s a specific arrangement of the pieces.

Pooling throws away information about the precise position of the entity within the region. Pooling reports the activity of the most active neurons and then forgets where they are, which gives you a small amount of translational variance.

Capsules are also robust against adversarial attacks b/c they learn the distribution of each class. This can be used to check for adversarial input from a different distribution.

Image Classification w/ Capsules

1 capsule/class in the top layer. so most examples only have 10 capsules (eg mnist) in first layer, but ImageNet would have 1000. it is at least O(n) where n is the number of classes. routing by aggreement causes it to be slow to train b/c of the inner loop.


Feed characters into layers of RNNs and get final state at end that represents the sequence.

With a basic RNN cell, we see a massive drop in performance when it comes to long sequences and the network needs to remember patterns which have occurred way at the beginning to infer things correctly at a current time step. And this is because of exploding and vanishing gradients.

LSTMs, which can remember information from the way past and also selectively forget stuff that is not required.

Residual/Skip Connections

“How do I make my network very deep without losing the gradient signal?”

The general idea is to reduce the distance between the signal coming from the networks loss and each individual layer. The way this is done is by adding a residual/direct connection between every layer and its predecessors. That way, the gradient can flow from each layer to its predecessors directly.

Applied ML

Recommender Systems



supervised task to rerank a list for optimal order. Data can be relevance judgements or log data. Log data gives you implicit ranking preference from clickstreams. With clickstream data you can either model clicks or ranking.



Distributed Representation/Representation Learning are popular across all fields currently. It all started w/ word2vec where items are clustered in an embedding space with respect to how similar they are in some $\text{context}$. These dense representations are then commonly used for downstream tasks like features for classification.


Given a context, surrounding words, predict the target word. Error propagates back down to the embeddings.

Figure: Skip-gram architecture

Candidate Sampling

Softmax over 5000 classes is slow.

So if you have 5000 classes to evaluate the softmax layer you need to do 5000 forward passes through your function $f​$ fro the different values of $x_i​$. Computational Complexity: $O(\abs{V})​$

This has motivated work into approximate softmax methods often called Candidate Sampling:

Text Classification

Given a sentence model the probability of a it’s label $P(y\given  X)$ where x is a variable length list of words.

Nice guide here:

Because bag-of-words isn’t an order-preserving tokenization method (the tokens gen-
erated are understood as a set, not a sequence, and the general structure of the sen-
tences is lost), it tends to be used in shallow language-processing models rather than
in deep-learning models. Extracting n-grams is a form of feature engineering, and
deep learning does away with this kind of rigid, brittle approach, replacing it with hier-
archical feature learning.


Classic approach which encodes the sentences/observations into fixed length feature vector length $\abs{V}$ . The encoding can be simple indicators for a word or counts or more popularly tf-idf counts.

Once you have your fixed length feature, you can use any type of classifier.


Instead of a fixed length feature vector w/ frequency information, we learn/use an embedding for each word/token.


Figure: A sequential Feed-forward MLP(DNN). Note for the 3D embedding layer: Although the diagram is ambiguous, the embeddings for each activated movie are either added or averaged.

One strategy to combine the speed and lightness of convnets with the order-sensitivity of RNNs is to use a 1D convnet as a preprocessing step before an RNN (see figure 6.30). This is especially beneficial when you’re dealing with sequences that are so long they can’t realistically be processed with RNNs, such as sequences with thousands of steps. The convnet will turn the long input sequence into much shorter (downsampled) sequences of higher-level features. This sequence of extracted features then becomes the input to the RNN part of the network.


Anomaly Detection

Isolation Forests: Unsupervised. Overfit a a bunch of trees (forest) and then the average path length to a point is called the anomaly score. Put in a threshold and you have a classifier. The random partitioning, on average, creates shorter paths for anomalies/outliers, which is intuitive b/c it should be hard to separate inliers.

Permalink: machine-learning-glossary


Last edited by Alex Egg, 2019-01-15 17:29:57
View Revision History