I am affiliated with ARC and played a major role in the MLP stuff
The particular infinite sum discussed in this post is used for approximating MLPs with just one hidden layer, so things like vanishing gradients can’t matter.
We are now doing work on deeper MLPs. In this case, the vanishing gradients story definitely does seem relevant. We definitely don’t fully understand every detail, but I’ll mouth off anyways.
On one hand, there are hyperparameter choices where gradients explode. It turns out that in this regime, matching sampling: exponentially large gradients mean that the average is a sum over exponentially many tiny and more-or-less independent regions, so it’s exponentially small and you can beat sampling by just answering zero until your compute budget is exponential. Even then you might not have the budget to go through all of input space, but we have ideas on what you could do.
On the other hand, there are regions where gradients vanish. Here, also, it seems like things work out. Since the inputs to neurons in later layers concentrate, you can do very well by just approximating the later non-linearities as linearizations around the concentrating point.
Then there’s the so-called ‘critical hyperparameters’ right in the middle. It turns out that for MLPs the gradients vanish (albeit more slowly) even at criticality. Back-of-the-envelope, it seems to go to zero just fast enough.
There are other convergence-related questions (in particular, a lot of these series are so-called ‘asymptotic series’ which officially never converge), and at smaller epsilons/higher compute budget those issues come into play as well, but I don’t think they’re connected to gradients.
I am affiliated with ARC and played a major role in the MLP stuff
I’m loosely familiar with Greg Yang’s work, and very familiar with the ‘Neural Network Gaussian Process’ canon. It’s definitely relevant, especially as an intuition pump, but it tends to answer a different question. They answer ‘what is the distribution of quantities x y and z over the set of all NNs’ where quantities x y and z might be some preactivation on specific inputs. Knowing that they are jointly gaussian with such-and-such covariance has been a powerful intuition pump for us. But the main problem we want is an algorithm that takes in a specific NN with specific weights and tells us about the average over inputs.
I’ve found that this distinction is a powerful antimeme, and every time I give a presentation on the topic I have a slide on the difference between averaging over x versus over theta. By the end of the talk the audience is clamoring to recommend I read Principles of Deep Learning Theory (which is lovely if you want to improve on NNGP, but not relevant to calculating averages for a specific value of theta.).