Like, if they stop improving at <1 month horizon lengths (as you say immediately above the text I quoted) that is clearly a case of LLMs hitting a wall right?
I distinguish “the LLM paradigm hitting a wall” and “the LLM paradigm running out of fuel for further scaling”.
I agree that compute and resources running out could cause this, but it’s notable that we expect ~1 month in not that long, like only ~3 years at the current rate.
Yes, precisely. Last I checked, we expected scaling to run out by 2029ish, no?
Ah, reading the comments, I see you expect there to be some inertia… Okay, 2032 / 7 more years would put us at “>1 year” task horizons. That does make me a bit more concerned. (Though 80% reliability is several doublings behind, and I expect tasks that involve real-world messiness to be even further behind.)
Ok, but surely there has to be something they aren’t getting better at (or are getting better at too slowly)
“Ability to come up with scientific innovations” seems to be one.
Like, I expect they are getting better at the underlying skill. If you had a benchmark which measures some toy version of “produce scientific innovations” (AidanBench?), and you plotted frontier models’ performance on it against time, you would see the number going up. But it currently seems to lag way behind other capabilities, and I likewise don’t expect it to reach dangerous heights before scaling runs out.
The way I would put it, the things LLMs are strictly not improving on are not “specific types of external tasks”. What I think they’re not getting better at – because it’s something they’ve never been capable of doing – are specific cognitive algorithms which allow to complete certain cognitive tasks in a dramatically more compute-efficient manner. We’ve talked about this some before.
I think that, in the limit of scaling, the LLM paradigm is equivalent to AGI, but that it’s not a very efficient way to approach this limit. And it’s less efficient along some dimensions of intelligence than along others.
This paradigm attempts to scale certain modules a generally intelligent mind would have to ridiculous levels of power in order to make up for the lack of other necessary modules. This will keep working to improve performance across all tasks, as long as you keep feeding LLMs more data and compute. But there seems to be only a few “GPT-4 to GPT-5” jumps left, and I don’t think it’d be enough.
I distinguish “the LLM paradigm hitting a wall” and “the LLM paradigm running out of fuel for further scaling”.
Yes, precisely. Last I checked, we expected scaling to run out by 2029ish, no?
Ah, reading the comments, I see you expect there to be some inertia… Okay, 2032 / 7 more years would put us at “>1 year” task horizons. That does make me a bit more concerned. (Though 80% reliability is several doublings behind, and I expect tasks that involve real-world messiness to be even further behind.)
“Ability to come up with scientific innovations” seems to be one.
Like, I expect they are getting better at the underlying skill. If you had a benchmark which measures some toy version of “produce scientific innovations” (AidanBench?), and you plotted frontier models’ performance on it against time, you would see the number going up. But it currently seems to lag way behind other capabilities, and I likewise don’t expect it to reach dangerous heights before scaling runs out.
The way I would put it, the things LLMs are strictly not improving on are not “specific types of external tasks”. What I think they’re not getting better at – because it’s something they’ve never been capable of doing – are specific cognitive algorithms which allow to complete certain cognitive tasks in a dramatically more compute-efficient manner. We’ve talked about this some before.
I think that, in the limit of scaling, the LLM paradigm is equivalent to AGI, but that it’s not a very efficient way to approach this limit. And it’s less efficient along some dimensions of intelligence than along others.
This paradigm attempts to scale certain modules a generally intelligent mind would have to ridiculous levels of power in order to make up for the lack of other necessary modules. This will keep working to improve performance across all tasks, as long as you keep feeding LLMs more data and compute. But there seems to be only a few “GPT-4 to GPT-5” jumps left, and I don’t think it’d be enough.