Li, Hao, et al. “Visualizing the Loss Landscape of Neural Nets.” arXiv preprint arXiv:1712.09913 (2017).
First of all, the method they use to visualize the loss functions, Filter Normalization, I don’t understand, and didn’t take effort to understand. I am more intersted in making inferences from them.
TLDR: Smaller (128) is better (then 8192).
They came up w/ a clever way to visualize loss functions. They compare gradient surfaces of convolutional networks using small vs large batches & weight decay vs no weight decay.
Using the filter-normalized plots in Figures 4 and 5, we can make side-by-side comparisons between minimizers, and we see that now sharpness correlates well with generalization error. Large batches produced visually sharper minima (although not dramatically so) with higher test error. Interestingly, the Adam optimizer attained larger test error than SGD, and,
as predicted, the corresponding minima are visually sharper. Results of a similar experiment using ResNet-56 are presented in the Appendix (Figure 12).
This goes against conventional wisdom, and at least textbook theory which typically states that the larger your batches, the smoother and more accurate your gradients. However, the authors go on to say that the small batch argument is moot if your turn on weight decay, in that case, large batches are just as good as small:
However, this sharpness balance can be flipped by simply turning on weight decay…. This time, the large batch minimizer is considerably flatter than the sharp small batch minimizer.
The authors go on to say in general a smaller batch size around 128 tends to generalize well across most cases.
Skip-connections to the rescue
TLDR: skip-connections help smooth the loss, especially for networks past 20 layers
The authors then go on to describe how poor gradients are a product of network depth and how skip-connections (a la ResNet) help mitigate that. Theytake the award winning VGG Arch taken from it’s original 16/19 layers to 56 and finally to 110 layers and show the chaotic surface. They then add skip-connections and show. how the surface is a lot more clean.
Interestingly, the effect of skip connections seems to be most important for deep networks. For the more shallow networks (ResNet-20 and ResNet-20-noshort), the effect of skip connections is fairly unnoticeable. However residual connections prevent the explosion of non-convexity that occurs when
networks get deep.
They then go on to show that wide-networks — networks w/ a relatively large amount of convolutional filters — have geometrically better loss surfaces also:
I wish they did analysis on Inception. Not sure why they would skip it considering it’s one of the most popular CNN architectures.
I wish the authors made more concrete conclusions, but what I could infer was:
- correlation between batch size and convexity: larger batch sizes (8192) are worse while smaller (128) are generally better. Who can train a batch size of 8192 anyways?! haha
- deeper than 20 layers and the loss surface becomes chaotic. UNLESS you use skip connections a la resnet architectures
- correlation between wideness and convexity: meaning you get a better loss surface the more CNN filters you have