Machine Learning Glossary

Alex Egg,


Information Theory

Shannon Entropy

Information entropy is the average rate at which information is produced by a stochastic source of data.

Where $p_i$ is the probability of class $i$. In the binary classification case, we can rollout the summation to:

Where p and q are the respective categories.


Figure: Distribution of entropy for binary random variable (coin). For a fair coin p=.5, there is maximum entropy.

Recursive DFS tree traversal

def traverse(node, visited):
    for node in node.children - visited:
				traverse(node, visited)
    if len(node.children)==0:

traverse(root, set())

Recursive BFS tree traversal:

q = []
def traverse(node, visited):
    while q:
        node = q.pop()
        for node in node.children - visited:

traverse(root, set())

Linear Algebra


Measurement of the magnitude of a vector. Typically used for quantifying model parameter complexity, where complexity is viewed as either vector magnitude or sparseness.


Distance measure between two vectors $A$ and $B$

Dot Product = $\sum_i \sum_j i\cdot j$

Cosine: normalized dot product (angle between two vectors)


Dot product is a $O(n)$ scalar operation. Can a vectorized operation run faster? BLAS etc… Is it faster in just wall time or is it less complex?

Same study for Matrix Multiply $O(n^3)$



Bayes Rule

Example: There is a cancer test that is 75% accurate (TP Rate/sensitivity):

The prior probably of having cancer in general is $P(Y=1)=.01$

The probability of a positive test $P(X=1)$:

What is the posterior probably $P(Y\given X)$ that you have cancer if you have a positive on the test?

Expectation Notation

For an unfair or weighted coin, the two outcomes are not equally likely. You can change the weight or distribution of the coin by dragging the true probability bars (on the right in blue) up or down. If we assign numbers to the outcomes — say, 1 for heads, 0 for tails — then we have created the mathematical object known as a random variable.

The expectation of a random variable is a number that attempts to capture the center of that random variable’s distribution. It can be interpreted as the long-run average of many independent samples from the given distribution. More precisely, it is defined as the probability-weighted sum of all possible values in the random variable’s support.

Which equals 3.5 for a dice: $1/6 + 21/6… +61/6$

Variance: Whereas expectation provides a measure of centrality, the variance of a random variable quantifies the spread of that random variable’s distribution. The variance is the average value of the squared difference between the random variable and its expectation,



Experiment Design


Figure: A normal/t distribution for the null hypothesis where $\mu=0$ w/ 2 tails highlighted being the regions the sample $\mu$ must lie for the p value to be below 0.05. You can see the sample mean lies between, -2 and -1 and the goal of the test is to see if this is “weird” or not. It is weird if it is in the tail.

Statistical Tests




MLE Maximum Likelihood Estimation

Method to fit data to a distribution. Goodfellow (2016) gushes over it.

When you try MLE on a gaussian you get MSE. When you try MLE on a Binomial you get Binary X-Entropy. This means that implicitly when you use MSE, you are doing regression (continuous estimation) and when you use X-Ent, you are doing (binary) classification.

MLE is a special case of Maximum A-posteriori Estimation (MAP) where the prior is uniform.


Given some measurements (ie weight of rabbits), get the normal distribution likelihood function which is parameterized by $\mu$ and $\sigma$ and then maximize the likelihood function to find the estimates for the parameters that fit the data.

MLE works down to:



Given some bernuli samples (ie free throws ) try to fit the binomial distribution, which is paramteramized by $p$, to the data. Get the PMF of the Bernoulli Distribution:

The Likelihood function $L(p\given x)$ is the product of each data point given the parameter:

Take the log for connivence down the road:

And you have recovered binary cross entropy ^^^^

Take derivative w/ respect to $p$ and solve for 0.


Random sample of the number of textbooks that a person buys for a semester at uni.

Poisson Dist is parameterized by $\lambda$ . PDF:

Now compose Likelihood function:

Log and take derivative at 0 to get the MLE estimator of $\hat\lambda_{MLE} =\frac{ \sum x_i}{ n}$

MAP Estimation

MLE is a special case of maximum a posteriori estimation (MAP) that assumes a uniform prior distribution of the parameters.

Includes Prior knowledge!

Bayes Rule:

Where $P(\Theta)$ can be a uniform distribution and thus ignored and $P(D)$ is considered a constant and thus ignored.

Or equivalently:

Where the first part is the posterior and the second part is the likelihood prior.

Expectation Maximization

Lambda Loss Paper uses this….

A way to calculate MLE/MAP

EM is usually useful when it’s hard to calculate the gradient of the log-likelihood. In cases I’m familiar with, this is due to the presence of latent variables per datapoint. You see it in Gaussian mixture models, hidden Markov models, and other problems. In order to avoid calculating the gradient, you iteratively:

This is guaranteed to converge to a local optimum.

Time Series Analysis

Classical TS modeling is Generative: one tried to model the process that generates the signal. The new package Prophet, is a Discriminative model which tries to fit a line to the data.

Autocorrelation Function Plot

The autocorrelation function is a measure of the correlation between observations of a time series that are separated by k time units (yt and yt–k).

Use the autocorrelation function and the partial autocorrelation functions together to identify ARIMA models. Examine the spikes at each lag to determine whether they are significant. A significant spike will extend beyond the significance limits, which indicates that the correlation for that lag doesn’t equal zero.

The data should be stationary before you interpret the autocorrelation plot. A stationary time series has a mean, variance, and autocorrelation function that are essentially constant through time.


ARIMA: Auto Regressive Integrated Moving Average

Dilated Causal Conv Net

Generative Models

Discriminative Models

Everything Else!


Class Imbalance

Good to look at recall

Feature Engineering

BOW (Vocabularies)

When you have a variable list of features (like words in a sentence, or number of products a user liked), then a common pattern to is transform this into a fixed length feature vector across the whole vocabulary $V$.

Buckets (Continuous Features)

Continuous features don’t need extra treatment, however, signal can be gained by knowing their properties. These tricks can help you stay on a linear model :)

Looking at the distribution of features, we can define relevant buckets for continuous variables, such as counters or timestamps. For instance, consider the number of products a user has seen. Seeing four products definitely reflects more engagement than seeing only one product. However, seeing a hundred and four products is not that different from seeing a hundred products, hence the need to bucketize continuous variables.

Interesting case study from scikit:

In 1D the linear regression can’t fit the data, but in 10 dimensions it can! This is because the inflexible linear model can’t fit the non-linear curve in 1d. However, if we let it fit in multiple dimensions, we can gain some expressiveness.

For example, consider raw data that represents the year a house was built. Instead of representing that year as a scalar numeric column, we could split the year into the following four buckets:


Date Range Represented as…
< 1960 [1, 0, 0, 0]
>= 1960 but < 1980 [0, 1, 0, 0]
>= 1980 but < 2000 [0, 0, 1, 0]
>= 2000 [0, 0, 0, 1]

Why would you want to split a number—a perfectly valid input to your model—into a categorical value? Well, notice that the categorization splits a single input number into a four-element vector. Therefore, the model now can learn four individual weights rather than just one; four weights creates a richer model than one weight. More importantly, bucketizing enables the model to clearly distinguish between different year categories since only one of the elements is set (1) and the other three elements are cleared (0). For example, when we just use a single number (a year) as input, a linear model can only learn a linear relationship. So, bucketing provides the model with additional flexibility that the model can use to learn.

Feature Crosses

Moreover, some variables make more sense when crossing them together to analyze the co-occurrence effect, such as the number of banner a user has seen on a particular device. The “fatigue” effect is heavily dependent on the screen used, and consequently on the device type the user is on. See factorization machines.

Latitude Longitude

Latitude and Longitude as numerical features alone don’t give much discriminative power. An interesting technique that I like is doing bucketed feature crosses. So first you discretize by some type of bucketing to to make lat and long ranges, then you do feature crosses on the ranges to make geo squares!

How do you handle cases where a point on the map is on the boundary of one of your grid squares? Reminds me of the time encoding task….


Learnt dense representation of sparse (categorical) data.

Anytime you have a categorical feature, you can probably learn an embedding


Cartesian Space: Time-of-day

Encoding the numerical Time-of-day as categories from 0-23 ruins the continuity from 12 am to 1am. Instead consider time a vector in polar space and then convert to cartesian space. This transformation will give you two dimensions in cartesian space.

The naive way to do this is to bucketize into parts of day or to make make a continuous feature out of the hours. But there is a more sophisticated was using trigonometry.

The problem of using hour of day is that it ruins the circular nature of time by adding a discontinuity between 12PM and 1AM. Instead, consider time as an angle w/ a unit vector and then covert that to 2d cartesian space.

The discontinuity is bad b/c we are saying there is a relative importance between 2am-12pm and not 12pm-1am when there really is!

Technically, this means adding two new features x,y in cartesian space. Now hour 23 and 0 are right next to each other.

import numpy as np

df['hr_sin'] = np.sin(*(2.*np.pi/24))
df['hr_cos'] = np.cos(*(2.*np.pi/24))
df['mnth_sin'] = np.sin((df.mnth-1)*(2.*np.pi/12))
df['mnth_cos'] = np.cos((df.mnth-1)*(2.*np.pi/12))

Dimensionality Reduction

KMeans Clustering

Clustering method which tries to minimize the in-cluster SSE. Parameterized by $K$ for the number of clusters/centroids. Very sensitive to the initial location of the centroids.

  1. Initiation of centroids (random or heuristic – very important)
  2. Assignment of data points to 1NN centroids
  3. Update centroids location (center of cluster/mean of cluster points)
  4. Repeat #2 and terminate upon convergence


Figure: KMeans run for 4 iterations

Linear Regression


Where SEE is the mean squared loss of the optimized model and SST is the MSE loss of the average of the data. 0 is worst, 1 is best.

You cannot blindly trust the R2 score or other summary statisitcs as a measure of fit goodness. Particularly, you cannot ignore the assumptions of a regression, which are:


(Most of these assumptions are in place to ensure the validity of the coefficient weights. For example if you ignored these assumptions and took some meaning from the coefficients then you will be misled.)

  1. Linear relationship between predictor and response and
  2. no multicollinearity
    • Predictors/features are correlated with each other. Gives wrong coefficient estimates, however, does not effect the response (so ignore it:) )
  3. No auto-correlation (statistical independence of the errors). Autocorrelation occurs when the residuals are not independent from each other.
    1. If your errors are not independent, then that means your observations are auto-correlated, then there is multi-collinearity which violates the second assumption.
    2. Bias: Your “best fit line” will likely be way off because it will be pulled away from the “true line” by the effect of the lagged errors.
    3. Inconsistency: Given the above, your sample estimators are unlikely to converge to the population parameters.
    4. Inefficiency: While it is theoretically possible, your residuals are unlikely to be homoskedastic if they are autocorrelated. Thus, your confidence intervals and you hypothesis tests will be unreliable.
  4. Normality of the residuals/error distribution
  5. Homoscedasticity (stationary/constant variance) of the residuals/error distribution

Least Squares

Closed form/analytical solution to the MSE/squared residuals/SSE error/loss function of linear regression.

where $X \in R^{m \ x \ p}$ and $X^TX \in R^{p \ x \ p}$ , so the growth rate is $p^3$ where $p$ is num of params. Incentivized to keep param count below 100ish…?


Matrix inversion is expensive n^3, where the n is the dimension of the square matrix: [p x n] [n x p ] = [p x p]​ so you have motivation to keep the number of parameters down!

Basis Function/Kernel

Linear models, such as logistic regression and linear regression, are appealing because they can be fit efficiently and reliably, either in closed form or with convex optimization. Linear models also have the obvious defect that the model capacity is limited to linear functions, so the model cannot understand the interaction between any two input variables

Transforms data points. Allows a linear combination in $w$ to learn non-linear relationships.

Consider the simple linear regression:

Now consider the basis form:

Where $\phi: R^d \rarr R^m$ (this could be non-linear)

Then we can view the above mentioned simple liner regression just a selection case where the basis function is the identify function $\phi(x)=x$

Example Basis functions:

For example, you cannot learn the XOR function using a linear model w/ an identity basis function. However, you can w/ a polynomial basis function, b/c the $x_1 * x_2$ term is linearly separable in $w$.

Great example here in the scikit docs:

Figure: Illustration of the mapping φ. On the left a set of samples in the input space, on the right the same samples in the feature space where the polynomial kernel K(x,y) (for some values of the parameters c and d) is the inner product. The hyperplane learned in feature space by an SVM is an ellipse in the input space.

Deep Learning: Instead of choosing a $\phi$ by hand, the strategy of DL is to learn $\phi$.


Discriminate Function: Assigns sample point to a class.

Label Smoothing/Soft Targets

Keep in mind though that you might be limiting yourself to dogs and cat types in your training set: The model might not generalize as well to a Shiba Inu as the model that is only trained with two classes.

yep, by introducing more fine grain labels, you force the neural net to learn factors of variation

bonus points: if you use “soft targets”, youll get even better generalization as your neural net will learn to place different breeds near each other on the manifold

yep, think of it as smearing the probability mass across different dog breeds (instead of 1-hot encoding)

See, I always get a bump in accuracy when I use label smoothing… but I get a hit in macro averaged F1 and macro averaged PR AUC. (See knowledge distillation)

essentially take the 1-hot vector (0,0,1,0), and make the targets soft, or smooth out the labels (they are the same thing): (.1,.1,.9,.1).

The network learns a specific achievable value for each class, which prevents overfitting (also makes sure that in cross entropy the network also tries to decrease the score for the wrong labels, rather than just increase the score for the right labels)

generally you don’t make the smoothed labels sum to 1.

The inception paper (rethinking) introduced this idea.

Logistic Regression

Where $\sigma$ is an activation function. You can also consider x as a basis function $\phi(x)$ as in Regressions.

The Logistic Regression(classification) coefficients are usually estimated using MLE. Unlike linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximize the likelihood function, so that an iterative process must be used instead; for example Newton’s method or GD. This process begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this revision until no more improvement is made, at which point the process is said to have converged.

Binary Cross Entropy Loss

Binary x-ent can be recovered by maximizing the likelihood of a binomial distribution.


Regularization is any process to help prevent overfitting in a model.

Weight Decay

Ocam’s Razor: Among competing hypothesis, the simplest is the best. In parametric models, simplicity can be considered either:

L2 regularization assumes a prior belief that weights should be small and normally distributed around zero. Interesting!

This is an interesting visualization from ISL which shows a red unregularized convex loss function in two dimensions w/ parameters $\beta_1$ and $\beta_2$, where the optimal value is $\hat{\beta}$. It also shows the two constrained regions of the L1 and L2 norm, which are centered around zero and small in magnitude. Thus we can see the optimal regularized optimum is not $\hat{\beta}$ but rather the intersection of the two regions. Notice that this makes both losses have smaller magnitude and in the L1 case makes the solution space sparse as $\beta_1=0$!

So in other words, to summarize, the absolute value’s going to be a constraint region that has sharp corners. In high dimensions, you have edges and corners. And along an edge or a corner, if you hit there, you get a 0. So this is, geometrically, why you get sparsity in the Lasso.

Interesting L1 visualization:


There is no universal model that preforms best on all data. Each model has tradeoffs, advantages/disadvantages against others (see no free lunch principle). In practice, it turns out that instead choosing the best model out of a set of models, if we combine many various, the results are better — often much better — and at little extra effort.

Creating such model ensembles is now standard. In the simplest technique, called

Tree Models (non parametric)

Decision Tree

Recursively Splits the decision space. Tries to partition the data in such a way to make PURE leafs – homogeneous groups.


Purity/disorder measurement of set:

where $n$ is the class id.

Information Gain

Information Gain is the average entropy of the children subtracted from the the entropy of the parent :

Where $p$ is the ratio of items in set A.


  1. For the given branch (root if first iteration)
    1. Find (exhaustively) feature/column w/ most potential information gain $f_i$
      1. For each column, For each value, calculate info gain @value, and return largest.
    2. Split the branch among the target classes with respect to $f_i$
    3. For each split (2): Repeat Step #1, unless leaf!
Entropy Example

Where $p_i$ is the probability of class $i$. In the binary classification case, we can rollout the summation to:

Where p and q are the respective categories.

X Y Class
    Green Circle
    Ping Cross
    Green Circle


Figure: 16 greens and 14 ping = 30 total

What is the potential entropy in the distribution above?

Very high entropy. This is a good column feature/column to split on.


Basic D-trees massively overfit, so pruning helps simplify the tree in order to generalize better


The theory bagging, in a nutshell, is that you have multiple models that you blend together to reduce variance and make your predictions more stable. That’s how a random forest works, it is a combination of n_estimators decision tree models that use majority voting (in the case of Random Forest Classifier) or straight averaging (in the case of Random Forest Regressor). Random Forests are called Bagging Meta-Estimators. Bagging reduces variance by introducing randomness into selection of the best feature to use at a particular split. Bagging estimators work best with when you have very deep levels on the decision trees. The randomness prevents overfitting the model on the training data.

In bagging, we use many overfitted classifiers (low bias but high variance) and do a bootstrap to reduce the variance.

Bagging is short for Bootstrap Aggregation. It uses several versions of the same model trained on slightly different samples of the training data to reduce variance without any noticeable effect on bias. Bagging could be computationally intensive esp. in terms of memory.

The intuition behind bagging is that averaging a set of observations reduces the variance. Given $Z_n$ observations w/ variance $\sigma^2$, the variance of the mean $Z$ is given by $\frac{\sigma^2}{n}$. Hence a natural way to reduce the variance and hence increasing the predicting accuracy of a model is to take many samples of the training set and average the resulting predictions.

Examples: Random Forrest


In boosting, we allow many weak classifiers (high bias with low variance) to learn form their mistakes sequentially with the aim that they can correct their high bias problem while maintaining the low-variance property.

Examples: AdaBoost, GradientBoost


Boosting relies on training several (simple, usually decision stumps) models successively each trying to learn from the errors of the models preceding it. Boosting differes from other algorithms in that in addition to it gives weights to training examples (as opposed to linear models which apply weights on features). So in essence we can weight the scarcer observations more heavily than the more populous ones. Boosting decreases bias and hardly affects variance (unless you are very sloppy). Depending on your n_estimators paramenter you are adding another inner-loop to your training step, so the the price of AdaBoost is an exensive jump in computational time and memory.


Making data linearly separable by transforming it“kernel-methods”


Maximum Margin Classifier. Tries to find 2 points, supports, which are the furthest away from the decision boundary. Trying to maximize the width of the street. Mathematical guarantee to find global maxima.

Bias/Variance Tradeoff

If you model is overfitting, then it is high-variance: learning noise. If it is underfitting, it is too biased and not capturing the signal.

Often this is quantified by looking at the Learning Curves:


How to remove variance (overfitting):

How to remove bias (underfitting):

bias/variance visual analogy




ROC curves are appropriate when the observations are balanced between each class


An evaluation metric that considers all possible classification thresholds.

The Area Under the ROC curve is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive



0th Order Black Box

Only $f(x)$ can be evaluated

1st Order Gradient Based

$\nabla{f(x)}$ Can be evaluated

Vanilla GD has fixed shared learning rates across parameters/dimensions. Later innovations brought momentum. The latest innovations are in adaptive and individual learning rates across dimensions.


Figure: chart from fastAI showing Loss correlation with LR and Batch size. Largest batch is blue which shows the smoothest curve and highest (optimal) learning rate. Purple is smallest batch size and you can see it has the lowest optimal learning rate. — the larger the batch the less noise in the gradient(loss)


(Vanilla) Gradient Descent is very sensitive to step size. In ill-conditioned setting, it’s very useful to have an adaptive step size: in regions of large variability of the gradient we would like to take small step sizes, while on regions with small variability we would like to take large step sizes.


W/o Momentum W/ Momentum
SGD without momentum SGD with momentum

Momentum [5] is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Image 3. It does this by adding a fraction $\gamma$ of the update vector of the past time step to the current update vector:

where $v_t$ can be seen as a exponentially decaying average of past gradients.


Adagrad is an adaptive learning rate method originally proposed by [Duchi et al]. Adapts the learning rate to the parameters, performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.

Independent parameter learning rates.

#Adagrad update
cache += dx**2 #vector initalized to 0 for every parameter
x += learning_rate * dx / (np.sqrt(cache) + 1e-7)

The cache continues to grow each step which helps to dampen the gradients. However, it will eventually become so large that it will dominate the denominator and cause learning to stop.

The fix for this is to decay the cache, so the system can keep learning: a la RMSProp.


Fixes the diminishing gradient problem w/ Adagrad.

In practice, deep nets Adagrad stops too early and RMSprop wins out.


RMSProp w/ Momentum

In addition to storing an exponentially decaying average of past squared gradients $v_t$ like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients $m_t$, similar to momentum.

Learning Rate Decay

All of these methods have Learning Rate as a hyperparameter. This should always be decayed (step, exponentially)

2nd Order

$\nabla^2{f(x)}$ Can be evaluated

Kfac, Newton, Gauss-Newton, Quasi-Newton, (L)BFGS

Newton’s Method

Newton’s method can be used to find a minimum or maximum of a function $f(x)$: .

The derivative is zero at a minimum or maximum, so local minima and maxima can be found by applying Newton’s method to the derivative. The iteration becomes:

LGBFS: batch mode only. Unlike mini-batch SGD, getting L-BFGS to work on mini-batches is more tricky and an active area of research.

Neural Networks

Function Approximtors.

Multi-layer perceptron is a misnomer. It should be multi-layer logistic regression!

B/C of the non-linear activations, we have to solve the non-convex problem iteratively.

Deep Learning

Layers of logistic regressions, separated by non-linearities. Each layer can learn a representation. Layer weights are learnt via back-propagation algorithm.

The non-linearities make it a composition of functions: $f(x)=f_3(f_2(f_1(x)))$ as opposed to linearities which would reduce to 1 function.

Going Deep

In theory, the deeper you go the more expressive power you can achieve, however there are tradeoffs. Diminishing Gradients: Back-propagated errors become very small after a few layers, which makes learning ineffective.


CNNs are motivated by 3 factors:

  1. Locality: neighbor relationships in spatial data (images, audio, sequences) are ignored by FFNs (see: local receptive fields below)
  2. Invariance: FFNs are not translation and scale invariant. They will memorize orientation and size. To mitigate this you would have to heavily regularize or do data augmentation of all possible combinations of orientation translations and size scales. (See subsampling below)
  3. Parameter Scale: Fully Connected Networks don’t scale well especially to image data: e.g. [200 x 200 x 3] image, would lead to neurons that have 200x200x3 = 120,000 weights. Multiple layers of this would lead to overfitting due to too many parameters. A CNN layer on the same input volume would only need to learn 10x5x5x3 = 750 parameters! (Assuming 10 [5 x 5] filters over 3 channels ). (See Weight sharing below)

These limitations are remedied in CNNs by:

  1. Local receptive fields
  2. Weight sharing:
  3. subsampling: Pooling



The figure above is a great example of the advantage that weight sharing in CNNs give. 6 [5 x 5] filters on a [32 x 32 x 3] image give a give a [28 x 28 x 6] activation map with only 5 x 5 x 3 x 6 = 450 parameters. Compared to 14M parameters if we flattened the image and did it w/ a fully connected layer!



Receptive Field: receptive field grows linearly w/ CNN layers. Dilated Convs let us grow receptive field exponentially w/ depth (more efficient in terms of params).

Each “A” sees two inputs. Each “B” sees two “As” which have a common input so each “B” has a receptive field of 3. Each B is exposed to exactly 3 inputs. In other words, receptive field grows linearly w/ layers.

Such 1D convnets can be competitive with RNNs on certain sequence-processing
problems, usually at a considerably cheaper computational cost. Recently, 1D conv nets, typically used with dilated kernels, have been used with great success for audio
generation and machine translation. In addition to these specific successes, it has long
been known that small 1D conv nets can offer a fast alternative to RNNs for simple tasks
such as text classification and timeseries forecasting.


Like LSTMs, causal convolutions can model sequences with long-term dependencies. This is achieved in two ways: stacking convolutional layers (with padding, every convolutional layer preserves the shape of the input), and dilation: insertion of gaps into the convolutional filters (otherwise known as atrous convolutions). convolutional filters applied to the sequence in a left-to-right fashion, emitting a representation at each step. They are causal in that the their output at time t is conditional on input up to t-1: this is necessary to ensure that they do not have access to the elements of the sequence we are trying to predict

Spotlight CNN Sequential model:

The convolution is causal because future states are never part of the convolution’s receptive field; this is achieved by left-padding the sequence.

Dilation: setting this to a number greater than 1 inserts gaps into the convolutional layers, increasing their receptive field without increasing the number of parameters. Stacked dilated convolutions enable networks to have very large receptive fields with just a few layers, while preserving the input resolution throughout the network as well as computational efficiency.

Dilated convolutions AKA atrous convolutions AKA convolutions with holes are another method of increasing the receptive field without angering the gradient gods. When we looked at stacking layers so far, we saw that the receptive field grows linearly with depth. Dilated convolutions let us grow the receptive field exponentially with depth.

Here we have a 3X3 filter applied with a dilation of 1,2 and 3. With a dilation of 1 we have a standard convolution. With a dilation of 2 we apply the same 3X3 filter but use every second pixel

Oord, Aaron van den, et al. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (2016).


P262 of the Keras book


CNNs are translation invariant, however, we want them go equivariant!

One of the goals of subsampling/pooling is to introduce invariance to the model. Meaning it tries to make the neural activities invariant to small viewpoint changes.

Input > Conv > relu > pooling

Subsampling or pooling in convents looses the spatial relationships in the images. for example a face isn’t an arbitrary combination of eyes + nose + ears + mouth, it’s a specific arrangement of the pieces.

Pooling throws away information about the precise position of the entity within the region. Pooling reports the activity of the most active neurons and then forgets where they are, which gives you a small amount of translational variance.

Capsules are also robust against adversarial attacks b/c they learn the distribution of each class. This can be used to check for adversarial input from a different distribution.

Image Classification w/ Capsules

1 capsule/class in the top layer. so most examples only have 10 capsules (eg mnist) in first layer, but ImageNet would have 1000. it is at least O(n) where n is the number of classes. routing by aggreement causes it to be slow to train b/c of the inner loop.


Stateful layers:

Feed characters into layers of RNNs and get final state at end that represents the sequence.

LSTMs (vanishing gradients)

With a basic RNN cell, we see a massive drop in performance when it comes to long sequences and the network needs to remember patterns which have occurred way at the beginning to infer things correctly at a current time step. And this is because of exploding and vanishing gradients.

LSTMs, which can remember information from the way past and also selectively forget stuff that is not required.

Residual/Skip Connections

“How do I make my network very deep without losing the gradient signal?”

The general idea is to reduce the distance between the signal coming from the networks loss and each individual layer. The way this is done is by adding a residual/direct connection between every layer and its predecessors. That way, the gradient can flow from each layer to its predecessors directly.

Dilated Causal Convs can possibly replace LSTMs

Causal Convs: layers of Convs that don’t look ahead.

Transfer Learning

Learning in one domain and transferring it to another.

Domain Transfer/Adaptation

One application of this is learning from a simulation .

Applied ML

Recommender Systems

Representation Learning. Most popular Rec Sys these days can be considered Representation Learning approaches. They all learn some dense representation of items and then uses that data structure to make predictions. This has a nice auxiliary product of similarity tables, however, it is also not directly modeling the recommendation task.

Cold Users

Next item models, or models where a user representation isn’t just an ID are more robust to the cold-user problem.

Cold Items

Most (all) LF architectures are vulnerable to this and the only real and popular way to deal with it is content filtering or hand made features.




supervised task to rerank a list for optimal order. Data can be relevance judgements or log data. Log data gives you implicit ranking preference from clickstreams. With clickstream data you can either model clicks or ranking.


Ranking Metrics


As you can see the denominator discounts the relevance more and more as the rank goes down.

Unbiased Bandit Data

Approaches that use log data, such as RecSys and LTR have to deal w/ a bias issues. This data, also called Bandit Data, is generated by a system. Compare this to the classical supervised learning approach w/ ground-truth labels, log data is biased by the system from which it is made.


Distributed Representation/Representation Learning are popular across all fields currently. It all started w/ word2vec where items are clustered in an embedding space with respect to how similar they are in some $\text{context}$. These dense representations are then commonly used for downstream tasks like features for classification.


Given a context, surrounding words, predict the target word. Error propagates back down to the embeddings.

Figure: Skip-gram architecture.

The figure above w/ the context “the cat sits on the” and the target “mat” can be generalized into many general use cases. For example Uber Eats learns embeddings where the context is the search query and the target is the item the user converted. For example: “Dan Dan Noodles” and {Item A}

Candidate Sampling

Softmax over 5000 classes is slow.

So if you have 5000 classes to evaluate the softmax layer you need to do 5000 forward passes through your function $f$ fro the different values of $x_i$. Computational Complexity: $O(\abs{V})$

This has motivated work into approximate softmax methods often called Candidate Sampling:

Text Classification

Given a sentence model the probability of a it’s label $P(y\given X)$ where x is a variable length list of words.

Nice guide here:

Because bag-of-words isn’t an order-preserving tokenization method (the tokens gen-
erated are understood as a set, not a sequence, and the general structure of the sen-
tences is lost), it tends to be used in shallow language-processing models rather than
in deep-learning models. Extracting n-grams is a form of feature engineering, and
deep learning does away with this kind of rigid, brittle approach, replacing it with hier-
archical feature learning.


Classic approach which encodes the sentences/observations into fixed length feature vector length $\abs{V}$ . The encoding can be simple indicators for a word or counts or more popularly tf-idf counts.

Once you have your fixed length feature, you can use any type of classifier.


Instead of a fixed length feature vector w/ frequency information, we learn/use an embedding for each word/token.


Figure: A sequential Feed-forward MLP(DNN). Note for the 3D embedding layer: Although the diagram is ambiguous, the embeddings for each activated movie are either added or averaged.

One strategy to combine the speed and lightness of convnets with the order-sensitivity of RNNs is to use a 1D convnet as a preprocessing step before an RNN (see figure 6.30). This is especially beneficial when you’re dealing with sequences that are so long they can’t realistically be processed with RNNs, such as sequences with thousands of steps. The convnet will turn the long input sequence into much shorter (downsampled) sequences of higher-level features. This sequence of extracted features then becomes the input to the RNN part of the network.


Anomaly Detection

Isolation Forests: Unsupervised. Overfit a a bunch of trees (forest) and then the average path length to a point is called the anomaly score. Put in a threshold and you have a classifier. The random partitioning, on average, creates shorter paths for anomalies/outliers, which is intuitive b/c it should be hard to separate inliers.

Permalink: machine-learning-glossary


Last edited by Alex Egg, 2019-06-20 12:16:56
View Revision History