So although I don’t think the NTK can be a final answer, I do like the idea of studying it in more depth—it provides a feature-learning-free baseline against which we can compare actual neural networks and other potential ‘grand theories’. Exactly which phenomena can we not explain with the NTK, and which theory best predicts them?
Strong upvote to Zach’s comment, it basically encapsulates my view (except that I don’t know what the “tractable-looking directions” he mentions are—Zach, can you elaborate?)
Exactly which phenomena can we not explain with the NTK
I’d turn that around: is there any explanation of why LLMs can do real-world task X and not real-world task Y that appeals to NTKs? (Not a rhetorical question: there may well be, I just haven’t seen one.)
Yeah, I can expand on that—this is obviously going be fairly opinionated, but there are a few things I’m excited about in this direction.
The first thing that comes to mind here is singular learning theory. I think all of my thoughts on DL theory are fairly strongly influenced by it at this point. It definitely doesn’t have all the answers at the moment, but it’s the single largest theory I’ve found that makes deep learning phenomena substantially “less surprising” (bonus points for these ideas preceding deep learning). For instance, one of the first things that SLT tells you is that the effective parameter count (RLCT) of your model can vary depending on the training distribution, allowing it to basically do internal model selection—the absence of bias-variance tradeoff, and the success of overparameterized models, aren’t surprising when you internalize this. The “connection to real world structure” aspect hasn’t been fully developed here, but it seems heavily suggested by the framework, in multiple ways—for instance, hierarchical statistical models are naturally singular statistical models, and the hierarchical structure is reflected in the singularities. (See also Tom Waring’s thesis).
Outside of SLT, there’s a few other areas I’m excited about—I’ll highlight just one. You mentioned Lin, Tegmark, and Rolnick—the broader literature on depth separations and the curse of dimensionality seems quite important. The approximation abilities of NNs are usually glossed over with universal approximation arguments, but this can’t be enough—for generic Lipschitz functions, universal approximation takes exponentially many parameters in the input dimension (this is a provable lower bound). So there has to be something special about the functions we care about in the real world. See this section of my post for more information. I’d highlight Poggio et al. here, which is the paper in the literature closest to my current view on this.
This isn’t a complete list, even of theoretical areas that I think could specifically help address the “real world structure” connection, but these are the two I’d feel bad not mentioning. This doesn’t include some of the more empirical findings in science of DL that I think are relevant, like simplicity bias, mode connectivity, grokking, etc. Or work outside DL that could be helpful to draw on, like Boolean circuit complexity, algorithmic information theory, natural abstractions, etc.
Agreed—that alone isn’t particularly much, just one of the easier things to express succinctly. (Though the fact that this predates deep learning does seem significant to me. And the fact that SLT can delineate precisely where statistical learning theory went wrong here seems important too.)
Another is that can explain phenomena like phase transitions, as observed in e.g. toy models of superposition, at a quantitative level. There’s also been a substantial chunk of non-SLT ML literature that has independently rediscovered small pieces of SLT, like failures of information geometry, importance of parameter degeneracies, etc. More speculatively, but what excites me most, is that empirical phenomena like grokking, mode connectivity, and circuits seem to intuitively fit in SLT nicely, though this hasn’t been demonstrated rigorously yet.
any explanation of why LLMs can do real-world task X and not real-world task Y that appeals to NTKs?
I don’t think there are any. Of course much the same could be said of other deep learning theories and most(all?) interpretability work. The difference, as far as I can tell, is that there is a clear pathway to getting such explanations from the NTK: you’d want to do a spectral analysis of the sorts of functions learnable by transformer-NTKs. It’s just that nobody has bothered to do this! That’s why I think this line of research is neglected relative to interpretability or developing a new theoretical analysis of deep learning. Another obvious thing to try: NTKs often empirically perform comparably well to finite networks, but are usually are a few percentage points worse in accuracy. Can we say anything about the examples where the NTK fails? Do they particularly depend on ‘feature learning’? I think NTKs are a good compliment to mechinterp in this regard, since they treat the weights at each neuron as independent of all others, so they provide a good indicator of exactly which examples may require interacting ‘circuits’ to be correctly classified.
I intended my comment to apply to “theories of deep learning” in general, the NTK was only meant as an example. I agree that the NTK has problems such that it can at best be a ‘provisional’ grand theory. The big question is how to think about feature learning. At this point, though, there are a lot of contenders for “feature learning theories”—the Maximal Update Parameterization, Depth Corrections to the NTK, Perturbation Theory, Singular Learning Theory, Stochastic Collapse, SGD-Induced Sparsity....
So although I don’t think the NTK can be a final answer, I do like the idea of studying it in more depth—it provides a feature-learning-free baseline against which we can compare actual neural networks and other potential ‘grand theories’. Exactly which phenomena can we not explain with the NTK, and which theory best predicts them?
Strong upvote to Zach’s comment, it basically encapsulates my view (except that I don’t know what the “tractable-looking directions” he mentions are—Zach, can you elaborate?)
I’d turn that around: is there any explanation of why LLMs can do real-world task X and not real-world task Y that appeals to NTKs? (Not a rhetorical question: there may well be, I just haven’t seen one.)
Yeah, I can expand on that—this is obviously going be fairly opinionated, but there are a few things I’m excited about in this direction.
The first thing that comes to mind here is singular learning theory. I think all of my thoughts on DL theory are fairly strongly influenced by it at this point. It definitely doesn’t have all the answers at the moment, but it’s the single largest theory I’ve found that makes deep learning phenomena substantially “less surprising” (bonus points for these ideas preceding deep learning). For instance, one of the first things that SLT tells you is that the effective parameter count (RLCT) of your model can vary depending on the training distribution, allowing it to basically do internal model selection—the absence of bias-variance tradeoff, and the success of overparameterized models, aren’t surprising when you internalize this. The “connection to real world structure” aspect hasn’t been fully developed here, but it seems heavily suggested by the framework, in multiple ways—for instance, hierarchical statistical models are naturally singular statistical models, and the hierarchical structure is reflected in the singularities. (See also Tom Waring’s thesis).
Outside of SLT, there’s a few other areas I’m excited about—I’ll highlight just one. You mentioned Lin, Tegmark, and Rolnick—the broader literature on depth separations and the curse of dimensionality seems quite important. The approximation abilities of NNs are usually glossed over with universal approximation arguments, but this can’t be enough—for generic Lipschitz functions, universal approximation takes exponentially many parameters in the input dimension (this is a provable lower bound). So there has to be something special about the functions we care about in the real world. See this section of my post for more information. I’d highlight Poggio et al. here, which is the paper in the literature closest to my current view on this.
This isn’t a complete list, even of theoretical areas that I think could specifically help address the “real world structure” connection, but these are the two I’d feel bad not mentioning. This doesn’t include some of the more empirical findings in science of DL that I think are relevant, like simplicity bias, mode connectivity, grokking, etc. Or work outside DL that could be helpful to draw on, like Boolean circuit complexity, algorithmic information theory, natural abstractions, etc.
FWIW most potential theories of deep learning are able to explain these, I don’t think this distinguishes SLT particularly much.
Agreed—that alone isn’t particularly much, just one of the easier things to express succinctly. (Though the fact that this predates deep learning does seem significant to me. And the fact that SLT can delineate precisely where statistical learning theory went wrong here seems important too.)
Another is that can explain phenomena like phase transitions, as observed in e.g. toy models of superposition, at a quantitative level. There’s also been a substantial chunk of non-SLT ML literature that has independently rediscovered small pieces of SLT, like failures of information geometry, importance of parameter degeneracies, etc. More speculatively, but what excites me most, is that empirical phenomena like grokking, mode connectivity, and circuits seem to intuitively fit in SLT nicely, though this hasn’t been demonstrated rigorously yet.
I don’t think there are any. Of course much the same could be said of other deep learning theories and most(all?) interpretability work. The difference, as far as I can tell, is that there is a clear pathway to getting such explanations from the NTK: you’d want to do a spectral analysis of the sorts of functions learnable by transformer-NTKs. It’s just that nobody has bothered to do this! That’s why I think this line of research is neglected relative to interpretability or developing a new theoretical analysis of deep learning. Another obvious thing to try: NTKs often empirically perform comparably well to finite networks, but are usually are a few percentage points worse in accuracy. Can we say anything about the examples where the NTK fails? Do they particularly depend on ‘feature learning’? I think NTKs are a good compliment to mechinterp in this regard, since they treat the weights at each neuron as independent of all others, so they provide a good indicator of exactly which examples may require interacting ‘circuits’ to be correctly classified.