I am affiliated with ARC and played a major role in the MLP stuff
I’m loosely familiar with Greg Yang’s work, and very familiar with the ‘Neural Network Gaussian Process’ canon. It’s definitely relevant, especially as an intuition pump, but it tends to answer a different question. They answer ‘what is the distribution of quantities x y and z over the set of all NNs’ where quantities x y and z might be some preactivation on specific inputs. Knowing that they are jointly gaussian with such-and-such covariance has been a powerful intuition pump for us. But the main problem we want is an algorithm that takes in a specific NN with specific weights and tells us about the average over inputs.
I’ve found that this distinction is a powerful antimeme, and every time I give a presentation on the topic I have a slide on the difference between averaging over x versus over theta. By the end of the talk the audience is clamoring to recommend I read Principles of Deep Learning Theory (which is lovely if you want to improve on NNGP, but not relevant to calculating averages for a specific value of theta.).
yeah while I noticed the distinction i usually find it worthwhile to try to steal tools across problem statements that use the same words in a different order, i’ll use your data point to downweight that heuristic a little thanks :p
Did knowing that the joint-gaussian thing generalizes to RNNs influence your decision to look at RNNs next?
Honestly for me it’s more of a strike against RNNs. Real deep neural networks that have been trained don’t have this property, so it’s a bridge we’re going to need to cross at some point regardless. From a derisking point of view I’d kind of like to get to that point ASAP. There’s a lot of talk about looking at random boolean circuits (which very obviously don’t have this property), narrow MLPs, or even jumping all the way to wide MLPs trained in some sort of mean-field/maximum update regime that gets rid of it.
Have you seen Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes?
I am affiliated with ARC and played a major role in the MLP stuff
I’m loosely familiar with Greg Yang’s work, and very familiar with the ‘Neural Network Gaussian Process’ canon. It’s definitely relevant, especially as an intuition pump, but it tends to answer a different question. They answer ‘what is the distribution of quantities x y and z over the set of all NNs’ where quantities x y and z might be some preactivation on specific inputs. Knowing that they are jointly gaussian with such-and-such covariance has been a powerful intuition pump for us. But the main problem we want is an algorithm that takes in a specific NN with specific weights and tells us about the average over inputs.
I’ve found that this distinction is a powerful antimeme, and every time I give a presentation on the topic I have a slide on the difference between averaging over x versus over theta. By the end of the talk the audience is clamoring to recommend I read Principles of Deep Learning Theory (which is lovely if you want to improve on NNGP, but not relevant to calculating averages for a specific value of theta.).
yeah while I noticed the distinction i usually find it worthwhile to try to steal tools across problem statements that use the same words in a different order, i’ll use your data point to downweight that heuristic a little thanks :p
Did knowing that the joint-gaussian thing generalizes to RNNs influence your decision to look at RNNs next?
Honestly for me it’s more of a strike against RNNs. Real deep neural networks that have been trained don’t have this property, so it’s a bridge we’re going to need to cross at some point regardless. From a derisking point of view I’d kind of like to get to that point ASAP. There’s a lot of talk about looking at random boolean circuits (which very obviously don’t have this property), narrow MLPs, or even jumping all the way to wide MLPs trained in some sort of mean-field/maximum update regime that gets rid of it.