Paper Review: Visualizing the Loss Landscape of Neural Nets

Alex Egg,

Li, Hao, et al. “Visualizing the Loss Landscape of Neural Nets.” arXiv preprint arXiv:1712.09913 (2017).

First of all, the method they use to visualize the loss functions, Filter Normalization, I don’t understand, and didn’t take effort to understand. I am more intersted in making inferences from them.

Batch Sizes

TLDR: Smaller (128) is better (then 8192).

They came up w/ a clever way to visualize loss functions. They compare gradient surfaces of convolutional networks using small vs large batches & weight decay vs no weight decay.

1712_09913__page_7_of_17_

Using the filter-normalized plots in Figures 4 and 5, we can make side-by-side comparisons between minimizers, and we see that now sharpness correlates well with generalization error. Large batches produced visually sharper minima (although not dramatically so) with higher test error. Interestingly, the Adam optimizer attained larger test error than SGD, and,
as predicted, the corresponding minima are visually sharper. Results of a similar experiment using ResNet-56 are presented in the Appendix (Figure 12).

This goes against conventional wisdom, and at least textbook theory which typically states that the larger your batches, the smoother and more accurate your gradients. However, the authors go on to say that the small batch argument is moot if your turn on weight decay, in that case, large batches are just as good as small:

However, this sharpness balance can be flipped by simply turning on weight decay…. This time, the large batch minimizer is considerably flatter than the sharp small batch minimizer.

The authors go on to say in general a smaller batch size around 128 tends to generalize well across most cases.

Skip-connections to the rescue

TLDR: skip-connections help smooth the loss, especially for networks past 20 layers

The authors then go on to describe how poor gradients are a product of network depth and how skip-connections (a la ResNet) help mitigate that. Theytake the award winning VGG Arch taken from it’s original 16/19 layers to 56 and finally to 110 layers and show the chaotic surface. They then add skip-connections and show. how the surface is a lot more clean.

1712_09913__page_9_of_17_

Interestingly, the effect of skip connections seems to be most important for deep networks. For the more shallow networks (ResNet-20 and ResNet-20-noshort), the effect of skip connections is fairly unnoticeable. However residual connections prevent the explosion of non-convexity that occurs when
networks get deep.

Wide networks

They then go on to show that wide-networks — networks w/ a relatively large amount of convolutional filters — have geometrically better loss surfaces also:

1712_09913__page_10_of_17_

Conclusions

I wish they did analysis on Inception. Not sure why they would skip it considering it’s one of the most popular CNN architectures.

I wish the authors made more concrete conclusions, but what I could infer was:

Permalink: paper-review-visualizing-the-loss-landscape-of-neural-nets

Tags:

Last edited by Alex Egg, 2018-01-09 02:30:46
View Revision History