I am affiliated with ARC and played a major role in the MLP stuff
The particular infinite sum discussed in this post is used for approximating MLPs with just one hidden layer, so things like vanishing gradients can’t matter.
We are now doing work on deeper MLPs. In this case, the vanishing gradients story definitely does seem relevant. We definitely don’t fully understand every detail, but I’ll mouth off anyways.
On one hand, there are hyperparameter choices where gradients explode. It turns out that in this regime, matching sampling: exponentially large gradients mean that the average is a sum over exponentially many tiny and more-or-less independent regions, so it’s exponentially small and you can beat sampling by just answering zero until your compute budget is exponential. Even then you might not have the budget to go through all of input space, but we have ideas on what you could do.
On the other hand, there are regions where gradients vanish. Here, also, it seems like things work out. Since the inputs to neurons in later layers concentrate, you can do very well by just approximating the later non-linearities as linearizations around the concentrating point.
Then there’s the so-called ‘critical hyperparameters’ right in the middle. It turns out that for MLPs the gradients vanish (albeit more slowly) even at criticality. Back-of-the-envelope, it seems to go to zero just fast enough.
There are other convergence-related questions (in particular, a lot of these series are so-called ‘asymptotic series’ which officially never converge), and at smaller epsilons/higher compute budget those issues come into play as well, but I don’t think they’re connected to gradients.
Do you expect that whether the infinite sum can be approximated is related to whether the architecture averts the vanishing gradient problem?
I am affiliated with ARC and played a major role in the MLP stuff
The particular infinite sum discussed in this post is used for approximating MLPs with just one hidden layer, so things like vanishing gradients can’t matter.
We are now doing work on deeper MLPs. In this case, the vanishing gradients story definitely does seem relevant. We definitely don’t fully understand every detail, but I’ll mouth off anyways.
On one hand, there are hyperparameter choices where gradients explode. It turns out that in this regime, matching sampling: exponentially large gradients mean that the average is a sum over exponentially many tiny and more-or-less independent regions, so it’s exponentially small and you can beat sampling by just answering zero until your compute budget is exponential. Even then you might not have the budget to go through all of input space, but we have ideas on what you could do.
On the other hand, there are regions where gradients vanish. Here, also, it seems like things work out. Since the inputs to neurons in later layers concentrate, you can do very well by just approximating the later non-linearities as linearizations around the concentrating point.
Then there’s the so-called ‘critical hyperparameters’ right in the middle. It turns out that for MLPs the gradients vanish (albeit more slowly) even at criticality. Back-of-the-envelope, it seems to go to zero just fast enough.
There are other convergence-related questions (in particular, a lot of these series are so-called ‘asymptotic series’ which officially never converge), and at smaller epsilons/higher compute budget those issues come into play as well, but I don’t think they’re connected to gradients.