In this paper, the authors look at generalization in DNNs. They find that if you take a dataset that generalised well w/ a DNN and then retrain it, but with completely random labels (instead of the original true labels), your training error will not decrease. They also also highlight an important discrepancy against traditional ML theory that states $p\ll n$ (where $p$ is dimensionality/parameters and $n$ is the number of observations) where in DL applications typically the number of parameters vastly outnumbers the training examples $p \gg n$ for example in most popular DCNNs there are over 8M parameters. This observation leads the authors to surmise that DNNs have the capacity to simply memorise the dataset. You may say that is an obvious argument as we all know these DNNs are very flexible models and will indeed learn the noise and overfit and that’s why we use regularisation. However, the authors go on to note that regularisation doesn’t even play a traditional role in DL as we think.
In the traditional ML school of thought regularisation is motivated by Ocam’s razor, which tells us simpler is better and simpler in ML is less total parameters or smaller magnitude parameters $p$ . Regularisation in DL is manifest in multiple ways including dropout. However, even w/ regulation, the DL model still fits the random labels which says regularization doesn’t help the generalization error and this is not working as expected in the traditional sense:
In contrast with classical convex empirical risk minimization, where explicit regularization is necessary to rule out trivial solutions, we found that regularization plays a rather different role in deep learning. It appears to be more of a tuning parameter that often helps improve the final test error of a model, but the absence of all regularization does not necessarily imply poor generalization error. As reported by Krizhevsky et al. (2012), L2-regularization (weight decay) sometimes even help optimization, illustrating its poorly understood nature in deep learning.
I think the main message of the paper is not to show “that generalization is bad for a problem with random labels”, but rather it shows training accuracy can be as good for a problem with randomized labels. While it is interesting and thought-provoking, I also think it is not super surprising, or groundbreaking. I could be wrong, but it seems the authors did not fully debunk a explaining theory (If they do, please definitely comment and let me know!) that: implicit regularization by SGD plays important role in selection of generalizable models over their counterparts. Namely, with the same cost, models with smaller norm are easy to “reach” under SGD training, and that’s why current SGD based training would produce generalizable models when they can also memorize.