Thomas Kwa comments on A Bear Case: My Predictions Regarding AI Progress

Thomas Kwa 6 Mar 2025 6:48 UTC
38 points
13
A continuous manifold of possible technologies is not required for continuous progress. All that is needed is for there to be many possible sources of improvements that can accumulate, and for these improvements to be small once low-hanging fruit is exhausted.
Case in point: the nanogpt speedrun, where the training time of a small LLM was reduced by 15x using 21 distinct innovations which touched basically every part of the model, including the optimizer, embeddings, attention, other architectural details, quantization, hyperparameters, code optimizations, and Pytorch version.
Most technologies are like this, and frontier AI has even more sources of improvement than the nanogpt speedrun because you can also change the training data and hardware. It’s not impossible that there’s a moment in AI like the invention of lasers or the telegraph, but this doesn’t happen with most technologies, and the fact that we have scaling laws somewhat points towards continuity even as other things like small differences being amplified in downstream metrics point to discontinuity. Also see my comment here on a similar topic.
If you think generalization is limited in the current regime, try to create AGI-complete benchmarks that the AIs won’t saturate until we reach some crucial innovation. People keep trying this and they keep saturating every year.
- p.b. 6 Mar 2025 19:19 UTC
  7 points
  0
  Parent
  If you think generalization is limited in the current regime, try to create AGI benchmarks that the AIs won’t saturate until we reach some crucial innovation. People keep trying this and they keep saturating every year.
  Because these benchmarks are all in the LLM paradigm: Single input, single output from a single distribution. Or they are multi-step problems on rails. Easy verification makes for benchmarks that can quickly be cracked by LLMs. Hard verification makes for benchmarks that aren’t used.
  One could let models play new board/computer games against average humans: Video/image input, action output.
  One could let models offer and complete tasks autonomously on freelancer platforms.
  One could enrol models in remote universities and see whether they autonomously reach graduation.
  It’s not difficult to come up with hard benchmarks for current models (these are not close to AGI complete). I think people don’t do this because they know that current models would be hopeless at benchmarks that actually aim for their shortcomings (agency, knowledge integration + integration of sensory information, continuous learning, reliability, …)
  - Thomas Kwa 7 Mar 2025 0:23 UTC
    7 points
    0
    Parent
    Easy verification makes for benchmarks that can quickly be cracked by LLMs. Hard verification makes for benchmarks that aren’t used.
    Agree, this is one big limitation of the paper I’m working on at METR. The first two ideas you listed are things I would very much like to measure, and the third something I would like to measure but is much harder than any current benchmark given that university takes humans years rather than hours. If we measure it right, we could tell whether generalization is steadily improving or plateauing.
- johnswentworth 6 Mar 2025 19:30 UTC
  6 points
  −6
  Parent
  I think you should address Thane’s concrete example:
  For example: “fully-connected neural networks → transformers” definitely wasn’t continuous.
  That seems to me a pretty damn solid knock-down counterargument. There were no continuous language model scaling laws before the transformer architecture, and not for lack of people trying to make language nets.
  - Erik Jenner 7 Mar 2025 9:23 UTC
    34 points
    18
    Parent
    There were no continuous language model scaling laws before the transformer architecture
    https://arxiv.org/abs/1712.00409 was technically published half a year after transformers, but it shows power-law language model scaling laws for LSTMs (several years before the Kaplan et al. paper, and without citing the transformer paper). It’s possible that transformer scaling laws are much better, I haven’t checked (and perhaps more importantly, transformer training lets you parallelize across tokens), just mentioning this because it seems relevant for the overall discussion of continuity in research.
    I also agree with Thomas Kwa’s sibling comment that transformers weren’t a single huge step. Fully-connected neural networks seem like a very strange comparison to make, I think the interesting question is whether transformers were a sudden single step relative to LSTMs. But I’d disagree even with that: Attention was introduced three years before transformers and was a big deal for machine translation. Self-attention was introduced somewhere between the first attention papers and transformers. And the transformer paper itself isn’t atomic, it consists of multiple ideas—replacing RNNs/LSTMs with self-attention is clearly the big one, but my impression is that multi-head attention, scaled dot product attention, and the specific architecture were pretty important to actually get their impressive results.
    To be clear, I agree that there are sometimes new technologies that are very different from the previous state of the art, but I think it’s a very relevant question just how common this is, in particular within AI. IMO the most recent great example is neural machine translation (NMT) replacing complex hand-designed systems starting in 2014---NMT worked very differently than the previous best machine translation systems, and surpassed them very quickly (by 2014 standards for “quick”). I expect something like this to happen again eventually, but it seems important to note that this was 10 years ago, and how much progress has been driven since then by many different innovations (+ scaling).
    ETA: maybe a crux is just how impressive progress over the past 10 years has been, and what it would look like to have “equivalent” progress before the next big shift. But I feel like in that case, you wouldn’t count transformers as a big important step either? My main claim here is that to the extent to which there’s been meaningful progress over the past 10 years, it was mostly driven by a large set of small-ish improvements, and gradual shifts of the paradigm.
  - Thomas Kwa 6 Mar 2025 21:30 UTC
    22 points
    8
    Parent
    Though the fully connected → transformers wasn’t infinite small steps, it definitely wasn’t a single step. We had to invent various sub-innovations like skip connections separately, progressing from RNNs to LSTM to GPT/BERT style transformers to today’s transformer++. The most you could claim is a single step is LSTM → transformer.
    Also if you graph perplexity over time, there’s basically no discontinuity from introducing transformers, just a possible change in slope that might be an artifact of switching from the purple to green measurement method. The story looks more like transformers being more able to utilize the exponentially increasing amounts of compute that people started using just before its introduction, which caused people to invest more in compute and other improvements over the next 8 years.
    We could get another single big architectural innovation that gives better returns to more compute, but I’d give a 50-50 chance that it would be only a slope change, not a discontinuity. Even conditional on discontinuity it might be pretty small. Personally my timelines are also short enough that there is limited time for this to happen before we get AGI.
    - Thane Ruthenis 6 Mar 2025 22:29 UTC
      7 points
      −5
      Parent
      This argument still seems to postdict that cars were invented by tinkering with carriages and horse-breeding, spacecraft was invented by tinkering with planes, refrigerators were invented by tinkering with cold cellars, et cetera.
      If you take the snapshot of the best technology that does X at some time T, and trace its lineage, sure, you’ll often see the procession of iterative improvements on some concepts and techniques. But that line won’t necessarily pass through the best-at-X technologies at times from 0 to T − 1.
      The best personal transportation method were horses, then cars. Cars were invented by iterating on preceding technologies and putting them together; but horses weren’t involved. Similar for the best technology at lifting a human being into the sky, the best technology for keeping food cold, etc.
      I expect that’s the default way significant technological advances happen. They don’t come from tinkering with the current-best-at-X tech. They come from putting together a bunch of insights from different or non-mainstream tech trees, and leveraging them for X in a novel way.
      And this is what I expect for AGI. It won’t come from tinkering with LLMs, it’ll come from a continuous-in-retrospect, surprising-in-advance contribution from some currently-disfavored line(s) of research.
      (Edit: I think what I would retract, though, is the point about there not being a continuous manifold of possible technological artefacts. I think something like “the space of ideas the human mind is capable of conceiving” is essentially it.)
      - Thomas Kwa 7 Mar 2025 1:02 UTC
        11 points
        1
        Parent
        I think we have two separate claims here:
        Do technologies that have lots of resources put into their development generally improve discontinuously or by huge slope changes?
        Do technologies often get displaced by technologies with a different lineage?
        I agree with your position on (2) here. But it seems like the claim in the post that sometime in the 2030s someone will make a single important architectural innovation that leads to takeover within a year mostly depends on (1), as it would require progress within that year to be comparable to all the progress from now until that year. Also you said the architectural innovation might be a slight tweak to the LLM architecture, which would mean it shares the same lineage.
        The history of machine learning seems pretty continuous wrt advance prediction. In the Epoch graph, the line fit on loss of the best LSTM up to 2016 sees a slope change of less than 2x, whereas a hypothetical innovation that causes takeover within a year with not much progress in the intervening 8 years would be ~8x. So it seems more likely to me (conditional on 2033 timelines and a big innovation) that we get some architectural innovation which has a moderately different lineage in 2027, it overtakes transformers’ performance in 2029, and afterward causes the rate of AI improvement to increase by something like 1.5x-2x.
        2 out of 3 of the technologies you listed probably have continuous improvement despite the lineage change
        1910-era cars were only a little better than horses, and the overall speed someone could travel long distances in the US probably increased in slope by <2x after cars due to things like road quality improvement before cars and improvements in ships and rail (though maybe railroads were a discontinuity, not sure)
        Before refrigerators we had low-quality refrigerators that would contaminate the ice with ammonia, and before that people shipped ice from Maine, so I would expect the cost/quality of refrigeration to have much less than an 8x slope change at the advent of mechanical refrigeration
        Only rockets were actually a discontinuity
        Tell me if you disagree.
        Thane Ruthenis 7 Mar 2025 20:23 UTC
        5 points
        2
        Parent
        I think we have two separate claims here:
        Indeed, and I’m glad we’ve converged on (2). But...
        Do technologies that have lots of resources put into their development generally improve discontinuously or by huge slope changes?
        … On second thoughts, how did we get there? The initial disagreement was how plausible it was for incremental changes to the LLM architecture to transform it into a qualitatively different type of architecture. It’s not about continuity-in-performance, it’s about continuity-in-design-space.
        Whether finding an AGI-complete architecture would lead to a discontinuous advancement in capabilities, to FOOM/RSI/sharp left turn, is a completely different topic from how smoothly we should expect AI architectures’ designs to change. And on that topic, (a) I’m not very interested in reference-class comparisons as opposed to direct gears-level modeling of this specific problem, (b) this is a bottomless rabbit hole/long-standing disagreement which I’m not interested in going into at this time.
        2 out of 3 of the technologies you listed probably have continuous improvement despite the lineage change
        That’s an interesting general pattern, if it checks out. Any guesses why that might be the case?
        My instinctive guess is the new-paradigm approaches tend to start out promising-in-theory, but initially very bad, people then tinker with prototypes, and the technology becomes commercially viable the moment it’s at least marginally better than the previous-paradigm SOTA. Which is why there’s an apparent performance-continuity despite a lineage/paradigm-discontinuity.