I haven’t noticed anyone else come out and say it here, and I may express this more rigorously later, but, like, GPT-5 is a pretty big bear signal, right? Not just in terms of benchmarks suggesting a non-superexponential trend but also, to name a few angles/anecdata:
It did slightly worse than o3 at the first thing I tried it on (with thinking mode on)
It failed to one-shot a practical coding problem that was entirely in javascript, which is generally among its strengths (and the problem was certainly in principle solvable)
It’s hallucinating pretty obviously when I give it a 100 page or so document to read and comment on (it references lots from the document, but gets many details wrong and overfixates on others)
It’s described as an automatic family of models that naturally picks the best for the situation, which seems like exactly what you’d do if nothing else was really giving you the sauce
The main reddit reaction seems to be that the demo graphs were atrocious, which is not exactly glowing praise
All the above paired with the fact that this is what they chose to call GPT-5, and with the fact that Claude’s latest release was a well-named and justified 0.1 increment
I’m largely with Zvi that even integrating this stuff as it already exists into the economy does some interesting stuff, and that we live in interesting times already. But other than what’s already priced in by integrations and efficiency optimizations, progress feels s-curvier to me today than it did a week ago.
Broadly agreed. My sense is that our default prediction from here should be to extrapolate out the METR horizon length trend (until compute becomes more scarce at least) with somewhere between the more recent faster doubling time (122 days) and the slower doubling time we’ve seen longer term (213 days) rather than expecting substantially faster than this trend in the short term.
So, maybe I expect a ~160 day doubling trend over the next 3 years which implies a 50% reliability horizon length of ~1 week in 2 years and ~1.5 months in 3 years. By the end of this trend, I expect small speed ups due to substantial automation of engineering in AI R&D and slow downs due to reducing availability of compute, while simultaneously, these AIs will be producing a bunch of value in the economy and AI revenue will continue to grow pretty quickly. But, 50% reliability at 1.5 month long easy-to-measure SWE tasks (or 80% reliability at week long tasks) doesn’t yield crazy automation, though such systems may well be superhuman in many ways that let them add a bunch of value.
(I think you’ll be seeing some superexponentiallity in the trend due to a mix of achieving generality and AI R&D automation, but I currently don’t expect this to make much of a difference by 50% reliability at 1.5 month long horizon lengths with easy to measure tasks.)
But, this isn’t that s-curve-y in terms of interpretation? It’s just that progress will proceed at a steady rate rather than yielding super powerful AIs within 3 years.
Also, I think ways in which GPT-5 is especially bad relative to o3 might be more evidence about this being a bad product launch in particular rather than evidence that progress is that much slower overall.
I plan on writing more about this topic in the future.
same, I was surprised. I think my timelines … ooh man I don’t know what my mental timeline predictor did but I don’t think “get longer” is a full description. winter is a scary time because summer might come any day
I haven’t noticed anyone else come out and say it here, and I may express this more rigorously later, but, like, GPT-5 is a pretty big bear signal, right? Not just in terms of benchmarks suggesting a non-superexponential trend but also, to name a few angles/anecdata:
It did slightly worse than o3 at the first thing I tried it on (with thinking mode on)
It failed to one-shot a practical coding problem that was entirely in javascript, which is generally among its strengths (and the problem was certainly in principle solvable)
It’s hallucinating pretty obviously when I give it a 100 page or so document to read and comment on (it references lots from the document, but gets many details wrong and overfixates on others)
It’s described as an automatic family of models that naturally picks the best for the situation, which seems like exactly what you’d do if nothing else was really giving you the sauce
The main reddit reaction seems to be that the demo graphs were atrocious, which is not exactly glowing praise
All the above paired with the fact that this is what they chose to call GPT-5, and with the fact that Claude’s latest release was a well-named and justified 0.1 increment
I’m largely with Zvi that even integrating this stuff as it already exists into the economy does some interesting stuff, and that we live in interesting times already. But other than what’s already priced in by integrations and efficiency optimizations, progress feels s-curvier to me today than it did a week ago.
Broadly agreed. My sense is that our default prediction from here should be to extrapolate out the METR horizon length trend (until compute becomes more scarce at least) with somewhere between the more recent faster doubling time (122 days) and the slower doubling time we’ve seen longer term (213 days) rather than expecting substantially faster than this trend in the short term.
So, maybe I expect a ~160 day doubling trend over the next 3 years which implies a 50% reliability horizon length of ~1 week in 2 years and ~1.5 months in 3 years. By the end of this trend, I expect small speed ups due to substantial automation of engineering in AI R&D and slow downs due to reducing availability of compute, while simultaneously, these AIs will be producing a bunch of value in the economy and AI revenue will continue to grow pretty quickly. But, 50% reliability at 1.5 month long easy-to-measure SWE tasks (or 80% reliability at week long tasks) doesn’t yield crazy automation, though such systems may well be superhuman in many ways that let them add a bunch of value.
(I think you’ll be seeing some superexponentiallity in the trend due to a mix of achieving generality and AI R&D automation, but I currently don’t expect this to make much of a difference by 50% reliability at 1.5 month long horizon lengths with easy to measure tasks.)
But, this isn’t that s-curve-y in terms of interpretation? It’s just that progress will proceed at a steady rate rather than yielding super powerful AIs within 3 years.
Also, I think ways in which GPT-5 is especially bad relative to o3 might be more evidence about this being a bad product launch in particular rather than evidence that progress is that much slower overall.
I plan on writing more about this topic in the future.
same, I was surprised. I think my timelines … ooh man I don’t know what my mental timeline predictor did but I don’t think “get longer” is a full description. winter is a scary time because summer might come any day
Mostly agree but also the slower (pre-reasoning model) exponential on task length is certainly not falsified and even looks slightly more plausible.