resolving some neural network mysteries

Link post

Here are some things about neural networks that I used to find puzzling but now feel that I have adequate explanations for. The theory behind these answers didn’t start to be understood until well after the correct things to do were found by chance or blind imitation of brains.

Why is good optimization possible?

Neural networks typically deal with “non-convex” optimization problems. Traditionally, using gradient descent for those was considered impractical, because it would rapidly get stuck in local minima. That was part of the motivation for evolutionary approaches.

Why, then, are neural networks trainable by gradient descent? Because if you add enough extra dimensions, non-convex problems become convex. Empirically, with massive overparameterization, the energy landscape tends to have many saddle points but few local minima. Showing theoretical convergence guarantees for overparameterized networks is a recent and ongoing research topic; see eg this.

As I previously noted, this is why sparse networks from iterative magnitude pruning have good performance, but sparse networks generally can’t be trained from scratch as well as dense networks.

This also explains some “thresholds” of neural network performance vs size: when overparameterization is proportional to problem non-convexity, good training becomes possible and performance improves significantly.

Why is generalization possible?

Adding enough free variables can turn non-convex problems into convex ones. Why didn’t people just do that in the past, then? Because far before you get to that point, the extra free variables led to overfitting that reduced test performance. People tried simple regularization like neural networks use, and that was completely inadequate.

Overparameterized neural networks can learn random data. Why do networks with fairly simple regularization tend to generalize?

Distance in neural network latent spaces being meaningful is basically the main useful thing about neural networks. Another phrasing of the above question is: Why is distance in latent spaces meaningful for latent space points not in the training set?

A few years back, some people noticed that neural network activation functions have spectral bias. With the types of activation functions used, low-frequency relationships are fit more quickly than high-frequency ones. That causes latent space relationships to be preferentially fit in such a way that point distances are related to point similarity. This can then be tuned by simple regularization: if you have spectral bias and balance learning rate vs regularization globally, you can control the frequency range learned.

An obvious way to test this theory of neural network generalization is to find some activation functions with relatively low spectral bias, and see how they perform. This paper tries a “hat” activation function, and finds that loss on the training set goes down much faster but test accuracy is much worse. This paper does some relevant tests on spectral bias.

It’s known that there is no universal best activation function. The optimal choice varies with:

different problem types
regularization settings
layer depth

I think using different activation functions for different depths is a semi-common technique at large AI labs now. This spectral bias framework can explain variations in relative performance of activation functions as spectral bias matching.

Why not mixed activation functions?

There are some reasons to think mixing activation functions in the same layer would be better:

Neural networks have many equivalent permutations of their variables. By mixing different activation functions in the same later, fewer permutations would be equivalent, which increases expressive power.
Using multiple activation functions could allow data to be fit more naturally, reducing the amount of inefficiently approximating functions with other functions.
Brains have mixed activation functions—per synapse, which is like having an activation function per weight.

Yet, empirically, mixing activation functions in the same layer tends to give slightly worse performance, and optimizing an activation mix for one situation doesn’t give good results for other situations. Why? My working theory is as follows:

Training involves shifting of data representation between neurons. When neurons have different activation functions, they’re less compatible so such shifting is harder.
Overfitting is based on the lowest spectral bias among the activation functions, so even a fraction of neurons having activation functions with low spectral bias is bad.

The greater expressivity of mixed activation functions then partially cancels out those disadvantages, and the result is performance that’s generally slightly worse.