A 23.5x improvement alone seems like it would qualify as a major explosion if it happened in a short enough period in time
Seems about true. I claim that the nanogpt speedrun suggests this is only likely if future AI labor is exponentially faster at doing research than current humans, with many caveats of course, and I don’t really have an opinion on that.
We already know that there is of course a fundamental limit to how fast you can make an algorithm, so the question is always “how close to optimal are current algorithms”. It should be our very strong prior that any small subset of frontier model training will hit diminishing returns much quicker than the complete whole.
This is not as small a subset of training as you might think. The 53 optimizations in the nanogpt speedrun touched basically every part of the model, including the optimizer, embeddings, attention, other architectural details, quantization, hyperparameters, code optimizations, and Pytorch version. The main two things that limit a comparison to frontier AI are scale and data improvement. It’s known there are many tricks that work at large scale but not at small scale. If you believe the initial 15x speedup is analogous and that the larger scale gives you a faster, then maybe we get something like a 100x speedup atop our current algorithms? But I don’t really believe that the original nanoGPT, which was a 300-line repo written to be readable rather than efficient [1], is analogous to our current state. If there were a bunch of low-hanging fruit that could give strongly superlinear returns, we would see 3x/year efficiency gains with small increases in labor or compute over time, but we actually require 5x/year compute increase and ~3x per year labor increase.
A software intelligence explosion is completely possible with linear speedups in cumulative effort. Indeed, it is possible with sublinear increases in cumulative effort.
Agree I was being a bit sloppy here. The derivative being infinite is not relevant in Davidson’s model or my mind, it’s whether the pace of progress accelerates or decelerates. It could still be very fast as it decelerates, but I’m not really thinking in enough detail to model these borderline cases, so maybe we should think of the threshold for very fast software-driven progress as r > 0.75 or something rather than r > 1.
I didn’t really define software intelligence explosion, but had something in mind like “self-reinforcing gains from automated research causing capabilities gains in 6 months to be faster than the labor/compute scaleup-driven gains in the 3 years from 2023-2025”, and then question I was targeting with the second part was “After the initial speed-up from ASARA, does the pace of progress accelerate or decelerate as AI progress feeds back on itself?”
Seems about true. I claim that the nanogpt speedrun suggests this is only likely if future AI labor is exponentially faster at doing research than current humans, with many caveats of course, and I don’t really have an opinion on that.
This is not as small a subset of training as you might think. The 53 optimizations in the nanogpt speedrun touched basically every part of the model, including the optimizer, embeddings, attention, other architectural details, quantization, hyperparameters, code optimizations, and Pytorch version. The main two things that limit a comparison to frontier AI are scale and data improvement. It’s known there are many tricks that work at large scale but not at small scale. If you believe the initial 15x speedup is analogous and that the larger scale gives you a faster, then maybe we get something like a 100x speedup atop our current algorithms? But I don’t really believe that the original nanoGPT, which was a 300-line repo written to be readable rather than efficient [1], is analogous to our current state. If there were a bunch of low-hanging fruit that could give strongly superlinear returns, we would see 3x/year efficiency gains with small increases in labor or compute over time, but we actually require 5x/year compute increase and ~3x per year labor increase.
Agree I was being a bit sloppy here. The derivative being infinite is not relevant in Davidson’s model or my mind, it’s whether the pace of progress accelerates or decelerates. It could still be very fast as it decelerates, but I’m not really thinking in enough detail to model these borderline cases, so maybe we should think of the threshold for very fast software-driven progress as r > 0.75 or something rather than r > 1.
[1]: “In addition, llm.c still has a lot of pending optimizations and people haven’t tried to tune the training in the style of cramming, so I’d say we’re likely to see significant improvements on this number.”
Cool, this clarifies things a good amount for me. Still have some confusion about how you are modeling things, but I feel less confused. Thank you!