Zhang, Chiyuan, et al. “Understanding deep learning requires rethinking generalization.” arXiv preprint arXiv:1611.03530(2016).
After discussion at reading group, there are the key points:
- If the DCNN is learning some deep semantic representation, then why can it still fit the training data when the labels are randomised? It seems like they are just memorising the dataset b/c they have a large amount of parameters.
- Then if they are memorising the dataset, how do they then generalise at all? The authors then explore regularisation a possible answer, but it turns out a regularised model still fits the random random labels.
In this paper, the authors look at generalization in DNNs.
There is currently not any real viable theory that explains why deep nets generalize well. We do know their generalization ability is not based on limiting the number of parameters (as in traditional ML). It’s common to use more parameters than data points, even on very small datasets. For example, MNIST has 60,000 training examples but most MNIST classifier models have millions of parameters—-a fully connected hidden layer with 1,000 inputs and 1,000 outputs has one million weights, just in that layer.
They find that if you take a dataset that generalised well w/ a DNN and then retrain it, but with completely random labels (instead of the original true labels), your training error will not decrease. They also also highlight an important discrepancy against traditional ML theory that states $p\ll n$ (where $p$ is dimensionality/parameters and $n$ is the number of observations) where in DL applications typically the number of parameters vastly outnumbers the training examples $p \gg n$ for example in most popular DCNNs there are over 8M parameters. This observation leads the authors to surmise that DNNs have the capacity to simply memorise the dataset. You may say that is an obvious argument as we all know these DNNs are very flexible models and will indeed learn the noise and overfit and that’s why we use regularisation. However, the authors go on to note that regularisation doesn’t even play a traditional role in DL as we think.
One would expect if one radomized the training labels that learning wouldn’t happen, or it would slow down and never converge. However, the results from the paper don’t show this. This has a few profound implications:
- The effective capacity of neural networks is sufficient for memorizing the entire data set.
- Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.
- Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.
Figure 1: Shows convergences on the train-set for random labels. Also an interesting observation is that random pixels in (a) converge more quickly than the original pixels! Maybe b/c random pixels are more separated than the original distributions.
In the traditional ML school of thought regularisation is motivated by Ocam’s razor, which tells us simpler is better and simpler in ML is less total parameters or smaller magnitude parameters $p$ . Regularisation in DL is manifest in multiple ways including dropout. However, even w/ regularisation, the DL model still fits the random labels which says regularization doesn’t help the generalization error and this is not working as expected in the traditional sense:
In contrast with classical convex empirical risk minimization, where explicit regularization is necessary to rule out trivial solutions, we found that regularization plays a rather different role in deep learning. It appears to be more of a tuning parameter that often helps improve the final test error of a model, but the absence of all regularization does not necessarily imply poor generalization error. As reported by Krizhevsky et al. (2012), L2-regularization (weight decay) sometimes even help optimization, illustrating its poorly understood nature in deep learning.
Figure 1: You can see regularisation helps the test accuracy. However, the more important point is that the model still generalises w/o it. So what does regularisation do?
I think the main message of the paper is not to show “that generalization is bad for a problem with random labels”, but rather it shows training accuracy can be as good for a problem with randomized labels. While it is interesting and thought-provoking, I also think it is not super surprising, or groundbreaking. I could be wrong, but it seems the authors did not fully debunk a explaining theory (If they do, please definitely comment and let me know!) that: implicit regularization by SGD plays important role in selection of generalizable models over their counterparts. Namely, with the same cost, models with smaller norm are easy to “reach” under SGD training, and that’s why current SGD based training would produce generalizable models when they can also memorize.
Figure 2: Goes on to show how how early stopping is a necessary technique to generalise well. Also it shows that the learning curves are much smoother w/ batch norm models i.e. inception.
In summary regarding regularization in DCNNs:
In summary, our observations on both explicit and implicit regularizers are consistently suggesting that regularizers, when properly tuned, could help to improve the generalization performance. However, it is unlikely that the regularizers are the fundamental reason for generalization, as the networks continue to perform well after all the regularizers removed.