LawrenceC comments on The paper that killed deep learning theory

LawrenceC 26 Apr 2026 23:49 UTC
20 points
0
Hm. I’m going to first nitpick your specific example of pre-standard model physics and the sun, then respond to the overall point.
My impression is that pre-WW2 physics had a lot more wins re: the sun than either deep learning theory circa 2016 or even current learning mechanics (as used in the Simon et al paper) has for current mech interp. (I’m not familiar with your unpublished mean field theory work, and I also haven’t been keeping up to date on Singular Learning Theory work.
1. From the spectrum lines we knew that the sun was made of hydrogen and helium by the late 1920s at the very latest: helium was first discovered in the sun in the late 1800s via its spectral lines, and Cecilia Payne-Gaposchkin’s PhD thesis Stellar Atmospheres that established the sun’s atmospheric composition was published in 1925.
2. Aston had built the mass spectrograph by 1920, which allowed him to measure the mass of helium and hydrogen nuclei. It turns out that four hydrogen nuclei (that is, 4 protons) weighed a statistically significantly amount more than a helium nuclei (Google tells me ~0.7%, though idk how precise their measurements were at the time). In the 1920s, Eddington argued that this mass defect was the source of energy for the sun (though he didn’t understand how the sun fused hydrogen into helium).
3. The Hertzprung and Russell diagram was plotted by the early 1900s. Perhaps a naive interpretation says that stars would move left and up along this diagram over time, though, Eddington’s 1920s work (and the existence of the mass-luminosity relationship between stars) suggested that the main sequence was a result of stars having different masses and not a thing they’d move along.
4. We had observations of white dwarfs by the 1920s, we knew they were held up by electron degeneracy pressure by the late 1926 (Pauli’s exclusion principle was published in 1925), and Chandrasekhar did the relativistic corrections to get the mass limit on white dwarves by 1930.
5. Chandrasekhar’s limit also suggested the existence of neutron stars, which were then discovered in 1930s. Baade and Zwicky’s 1934 paper then proposed that the transition from main sequence → neutron star was the reason for super nova.
6. Much earlier work in biology—eg the speed of evolution and age of the fossil record—already suggested that the sun and the earth have been around in something like their current configuration for hundreds of millions if not billions of years, which put pretty strong limits on the speed of stellar evolution.
7. The main missing piece for predicting the fate of the sun was that it wasn’t obvious how white dwarves came to be, or even when stars would leave the main sequence. (Probably the first explanation was from Chandrasekhar in 1942.) We were also missing a lot of facts about how exactly the sun fused hydrogen into helium and how supernovas would happen.
So I think that, while people did not know whether the sun would go nova or somehow transition into a white dwarf, we knew the composition of the sun (1), the mechanism through which it generated energy (2), had statistical relationships between stellar mass, luminosity, and temperature (3), had proposed two stellar remnants that the sun could eventually become (4, 5), and had a strong sense of what the time scale these events would happen at (6).
Arguably, deep learning theory circa 2016 had only a (6) analogue (namely, how well deep learning seemed to work in practice). Deep learning theory in 2026 has in addition its versions of (3) (scaling laws, which were first noted around 2020) and rudimentary analogues of (4) and (5) (NTKs/infinite width/depth limits in general, perhaps?). But we don’t have any analogues for 1 or 2 (which are the mechanisms through which SGD on deep neural networks produces smart systems) except in a very rudimentary way (something about circuits, gradient scales, and loss landscapes, perhaps?), and there’s a fair amount of evidence that our ⁴⁄₅ analogues aren’t great.
Specifically, I think the relevant work for understanding the fate of the sun (at least on the scale of “does it go nova”) wasn’t so much the standard model as relativistic quantum mechanics + general relativity, a lot of which had been worked out by the 1930s. Spectral lines gave us the composition of the sun, e=mc^2 suggested the mechanism for energy generation, and relativistic quantum mechanics pointed to two possible stellar remnants.
I think there’s two ways to make your case go through, though:
1. You can argue that circa 1935, it’d seem that additional work on “pure” quantum mechanics would not help our understanding of the sun, while it was demonstrably very important. So from a “figure out if the sun will explode” standpoint, it might look like we shouldn’t be spending any additional research effort into QM in itself, which would cause us to miss a bunch of facts about the fate of the sun. Examples:
2. 1. iirc you need a theory of weak interactions to compute the rate of p + p → d + e⁺ + ν in the sun.
  2. we had no explanation of why the sun seemed to produce far fewer neutrinos than expected for decades, and this one actually does require the standard model to solve (though to be fair, we also didn’t notice the problem until 1960).
  3. (Claude tells me that) the actual math for parts of the red giant → white dwarf transition for our sun involved a lot of complicated QM.
3. I think the closest analogue to the Zhang et al paper re: the Sun and physics probably happened with Kelvin’s calculations about the age of the sun and the earth. He didn’t know about radiation or e=mc^2, so his calculations gave the age of the sun and earth at something like 20 million years. This was far below the hundreds of millions-billions of years bound set by biology results at the time. Maybe then one would conclude that in the late 1800s that you should not study classical mechanics in order to understand the sun, or at least, that further investment in research like that was unlikely to produce relevant results.
Responses to these two cases:
1. It’s plausible that this argument might hold up—my technical QM knowledge basically ends with work in the 1940s. But I think it’s not as strong an argument as “your argument implies people shouldn’t study theoretical physics”, because so much of stellar evolution could be (and had been) worked out with just basic relativistic QM and general relativity. So a QFT person could point to the success of relativistic QM as evidence for why additional investment in QM research would further our understanding of the stars.
2. I think the analogous dismissal of theoretical research here is basically correct—I’m pretty sure that no amount of additional theoretical work without knowledge of radiation, QM, or relativity would’ve let Kelvin calculate the age of the sun correctly from first principles, let alone figured out whether it’d eventually explode.
To clarify, my case isn’t so much “don’t invest in theory research”—I actually think we should invest in much more theory research! -- as opposed to “I don’t expect classical learning theory approaches to work, and the standard academic response to this (learning mechanics) seems to have just dropped the level of ambition so much as to be uninteresting”. Specifically, I think any theoretical approach that seeks to explain large chunks of deep learning should be able to articulate a clear novel insight ala relativity or QM that differentiates it from pre-2019 learning theory, or else be treated with a level of skepticism that can only be overcome with interesting empirical results.
The Singular Learning Theory people have several insights of this form—e.g. the old learning theory both used a wrong notion of generalization and failed to account for degeneracies in the loss landscape. ARC Theory and the comp mech people both have their own versions of these insights as well. I suspect you can articulate something like this for your work with Kareel (though I haven’t seen it) as well—iirc it was something like “modern versions of mean field theory in physics can handle higher order interactions than the earlier approaches that people used for studying generalization” or something?
- Dmitry Vaintrob 27 Apr 2026 0:49 UTC
  8 points
  2
  Parent
  Otherwise your picture makes sense. I think “learning theory” that I interact with is quite different from what’s typically encountered in interp world (and this needs fixing). In particular what you call the SLT insights are in fact much older and standard (and in general aren’t related to singularities)
- Dmitry Vaintrob 27 Apr 2026 0:43 UTC
  4 points
  0
  Parent
  Very good nitpicks. I definitely don’t know my physics history well (but even with my limited knowledge, I was gesturing at a cartoon level of understanding that mixed different early-20th-century pictures on different phenomena).
  
  Re mean field—it’s not higher order, but lower order. Mean field is to NTK what classical mechanics is to quantum mechanics (in particular NTK + higher order corrections still has most of the bad generalization properties of NTK). The new insight is that while the expansion in NTK is always around a trivial classical theory, nontrivial classical theories also exist and are better-behaved from a complexity viewpoint