Transfer learning is when you take model that has been trained for one problem and then retrofit it for another problem.
Imagine you have a model with a deep CNN architecture trained to classify images. The last layer of you architecture is a softmax to convert scores to a confidence distribution. The scores are typically a 4096D feature vector (Fig 1) describing the image. The theory is that if you have a good model that generalises the images well, then you can piggy back off that and train a model for a very specific use case.
Figure 1: Deep CNN with 4096D Image Feature Vector
We can visually prove the generalisation power of a deep CNN arch, by using a dimensionality reduction technique, like t-sne, on a 2D scatter plot.
Figure 2: Demonstration of the generalisation power of a Deep CNN architecture trained on ImageNet. See the clear clustering of categories. Live demo: http://sharknado.eggie5.com/tsne
To continue the example of Women’s Clothing presented in Figure 1 & 2, if we pass our training set of Women’s Clothing images through the pre-trained Deep CNN we can collect all the outputs as a new training set. Now we can use our new dataset to train a simple classifier to predict the Women’s Clothing class labels.
For example, ImageNet does not include any of these Women’s Clothing labels we’re training on here. However, the kinds of information that make it possible for ImageNet to differentiate among 1,000 classes are also useful for distinguishing other objects. By using this pre-trained network, we are using that information as input to the final classification layer that distinguishes our clothing classes.
Generalization power of Deep CNN architectures
The generalisation power of deep CNN architectures was studied in depth in the DeCAF paper: here are some highlights from their research:
They open w/ comments about difficulties training Deep CNN architectures mostly related to overfitting due to lack of data:
“With limited training data, however, fully-supervised deep architectures with the representational capacity of (Krizhevsky et al., 2012) will generally dramatically overfit the training data. In fact, many conventional visual recognition challenges have tasks with few training examples; e.g., when a user is defining a category “on-the-fly” using specific examples, or for fine-grained recognition challenges (Welinder et al., 2010), attributes (Bourdev et al., 2011), and/or domain adaptation (Saenko et al., 2010).”
And then states that lack of data may not be such a hurdle if you can piggy back off of a pre-trained model:
“[Deep CNN architectures,] where representations are learned on a set of related problems but applied to new tasks which have too few training examples to learn a full deep representation.”
They then formalise the solution as Transfer Learning and introduce an analogy to human learning:
“We investigate the “supervised pre-training” approach proven successful in computer vision and multimedia settings using a concept-bank paradigm (Kennedy & Hauptmann, 2006; Li et al., 2010; Torresani et al., 2010) by learning the features on large-scale data in a supervised setting, then transferring them to different tasks with different labels.”
“This also aligns with the philosophy of supervised transfer: one may view the trained model as an analog to the prior knowledge a human obtains from previous visual experiences, which helps in learning new tasks more efficiently.”
They then show how feature vectors from the second to last layer of Deep CNN architectures trained on ImageNet can be used as input to a new classifier trained on different class labels. This has two advantages: allows you to skip training a classifier from scratch and allows you to leverage powerful pre-trained models.
Figure 3: Figure c & d, from the DeCAF paper shows the clustering of ImageNet categories in 2 dimensions, where c is features from the first pooling layer and d is from the penultimate layer. Clearly defined clusters in d, shows how the first layers learn low-level features, whereas the latter layers learn semantic or more high-level features.
They then go on to describe the flexibility of the transfer method and how the base model doesn’t need much in relation to the new training data. For example, DeCAF preformed well on scene recognition (SUN397) even thought original arch. was not trained on a related dataset (ImageNet) which shows the generalisation power.
They also highlight the point again that the new training data can be very sparse if the base model generalises well. For example in some cases categories w/ only 1 training example still achieve desirable results
“Our one-shot learning results (e.g., 33.0% for SVM) suggest that with sufficiently strong representations like DeCAF, useful models of visual categories can often be learned from just a single positive example.”
Figure 4: Eval scores for DeCAF Transfer Learning model
In figure 4, evaluation results of two different types of models trained on Caltech-101 images w/ feature vectors extracted from 3 different parts of the CNN arch are presetned. The SVM and logistic regression trained from the penultimate layer of the deep CNN preformed on par w/ each other.
Also, very interesting results showing in Figure 5, show that the transfer learning model can get meaningful inferences with only 1 examples per class when non-transfer learned-standard ML models can’t get meaningful results until at least 10 examples per class:
Figure 5: Necessary examples per class to get inference results. See Transfer Learning models can start w/ only 1 example per class.
Case Study: Image Sentiment Analysis
In , Campos compares an image sentiment model based on transfer learning from AlexNet to a custom Deep CNN arch. trained specifically for sentiment analysis and the results suprisingly show that the general-purpose AlexNet model outpreforms the sentiment-specific model:
Surprisingly, fine-tuning a net that was originally trained for object recognition reported higher accuracy in visual sentiment prediction than a CNN that was specifically trained for that task.
One explaintion for this possibly suggests the importance of high-level representations such as visual semantics gained from Transfer Learning approaches of models trained on ImageNet. On the other hand it could represent the importance of a high amount of convolutioal layers which  doesn’t explore.
 You et al. Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks, AAAI Jan 2015
 Campos, Salvador Diving Deep into Sentiment: Understanding Fine tuned CNNs for Visual Sentiment Prediction, Aug 2015