The thing is, humans are also lousy outside their training distribution. This is less obvious because our training distributions vary so much. But the phenomenon where some problem or technological need has been unsolved for many years, and then three groups solve it almost simultaneously, is generally because solving almost any hard problems requires combining about 3-5 other ideas. Consider one that takes 5. It’s pretty much impossible until 3 of those ideas have been invented and publicized. Then it’s really, really, hard: you have to spot three things are relevant and how to combine them, then come up with two separate great ideas to fill the gaps. But once 4 of them are done, the threshold drops: now you only have to spot and combine 4 things and come up with 1, and once the 5th one has come along, all you have to do is spot the pieces and figure out how to put them together. So as progress continues, the problem gets drastically easier. And then suddenly three groups solve the same problem, by assembling the same mix of ideas, one of which is recent.
LLMs can combine things that have never previously be combined in new ways and can thus successfully extrapolate outside the training distribution. Currently, they’re superhuman at knowing about all the ideas that have been come up with as of their knowledge cutoff – that’s a breadth of knowledge skill where they easily outperform humans – and clearly less good at figuring out how to assemble them, or especially at inventing a new missing idea to fill in gaps.
My question is, are those two skills both ones they are always going to be subhuman at, or are they just things they’re currently bad at? Their capabilities are so spiky compared to humans, it’s hard to be sure, but there are plenty of things where people said “LLMs are extremely bad at X”, and they were right at the time, but a few years and model generations later LLMs caught up, and are no longer bad at X. So I’m not going to be astonished if both of these go the same way.
Now, LLMs are very, very good at standing on the shoulders of giants. So it’s easy to mistake them for smarter than they really are. but current models still have plenty of things they’re subhuman at, as well as quite a few things they’re superhuman at. But they average out at somewhere in the rough vicinity of a grad student or an intern working for a few hours. Who are not generally the people who come up with new inventions,
Not sure why the go-to examples for out-of-distribution problems tend to be the extreme of an entirely new theory or invention. To make progress on this problem, we’d want to identify minimally-OOD problems and benchmark those, wouldn’t we?
Melanie Mitchell and collaborators showed weaknesses in LLM on OOD tasks with simple perturbations in the alphabet for string-analogy tasks. This seems like the sort of example we should generally be thinking about and testing, because they’re likely much more tractable, toy domains or simple ad hoc tasks that deviate from strong biases in the training distribution.
Demonstrating this with simpler less-challenging tasks should give us some idea whether this is an area that LLMs are poor-but-improving at and will sooner-or-later catch up, or genuinely bad at and always will be for some architectural reasons. Sounds like a good idea, but not something I know anything about (and a bit too capabilities related for my taste)
My fundamental rule-of-thumb on this sort of issue is that it’s conjectured that SGD with suitable hyperparameters approximates Bayesian learning. If that’s correct, then Bayesian learning is optimal, modulo issues like training dataset, choice of priors/inductive biases, etc. So a comparative difference with humans would then have to come down to things like the quality of approximation of Bayesian learning, the priors/inductive biases, the choice of pretraining dataset, curriculum learning effects, or architectural limitations that make certain things nigh impossible for the LLM (for example lack of continual learning, or a text only transformer doing work on video or audio data).
For my example of combining preexisting ideas in the right way to solve a well-known problem, most of the impressive human examples of that involve months of work, so the current LLMs lack of continual learning and task horizon in the hours range is going to make doing that nigh impossible at the moment. Humans generally work their ways out of distribution slowly, one small step at a time, by gradually expanding the distribution in an interesting-looking direction: that’s what the Scientific method/Bayesian learning is. Doing that without continual learning is inherently limited. So I find Jeremy Howard’s observation that LLMs are bad at this unsurprising: I think it basically reduces to two of the widely-known deficiencies that LLMs currently have (and the industry was already busily working on).
I actually implemented my own private benchmark last year to try to test this with different models. The domain was a toy OOD task where the system had access to three possible tools that performed simple transformations on a configuration of binary values in a particular spatial arrangement. Stage 1 was exploration. The system was given a certain number of steps to probe with the tools (which were chosen randomly from a subset prior to each trial). After the experimentation stage, the system was required to use the tools to perform a transformation on a random arrangement to make it match a target one.
The exercise of building a benchmark was a great learning experience for me. My main takeaway was that differences in performance were nearly all driven by differences in scaffolding, and not so much the base model. This made me fairly disillusioned about benchmarks in general. Made me suspect that gains in benchmarks like ARC-AGI are mostly driven by scaffolding improvements. Maybe someone here has much more insight into that.
But it also made me think that the problem is probably not some far-out radically intractable problem. You mention continual learning and long time horizons. Just generally for OOD tasks, the system needs to be able to log results, generate and revise hypotheses, and carry out Bayesian updates in an iterative manner. Whether that can be cracked reliably for increasingly difficult problems with relatively straightforward scaffolding, or the base models need to be radically improved along with scaffolding, I don’t really know. Maybe for the much more difficult problems (like a Theory of Everything or a cure for the common cold) those advances are very far out. I would think though, that for simple and medium-difficulty problems, the frontier labs are already well on their way.
So with decent scaffolding (search, summarization, etc) and 1m-token context memory, one can do quite a lot even without a robust solution to continual learning? That matches the current situations for quite a lot of agentic tasks.
ARC-AGi is notorious for being insoluble without scaffolding (e.g domain-specific languages), and strongly scaffolding-dependent with it. Scores on it do depend somewhat on model capacity, but are also strongly dependent on the effort and skill put into building scaffolding for it. What would impress me most would be a score where the model built its own scaffolding with only some small amount of human assistance (ideally, zero)
I’m not sure I would use terms like Lipschitz continuity, KL divergence, spurious oscillations, OOD divergence or something else that would highlight the point, but when I imagine myself in a coworker / tech lead / management role working with human software engineers before 2024 vs myself as a software engineer working with LLM-powered coding assistants in 2026, there is a very clear difference in the kinds of “outside” with regards of training distribution in human-human vs human-LLM interactions, the latter being really really fucking annoying tiring shit in every single interaction, while the former is “it depends” (a.k.a. “hiring a team that will be a good match together”).
The agentic scaffolds of 2025+ are making it possible to work around some of the fundamental jaggedness of LLM base models which are still complete shit at “understanding” so we are collectively moving ever more problems into “within distribution” instead of “divergent extrapolation”, sure, so I agree it’s totally unpredictable if LLM-powered tools will be able to automate tasks enough to become the kind of dangerous agents for which it makes sense to reason about theoretic-rational instrumental goals even if LLMs alone might remain shit at goal-orientednes forever (or if we need different architecture) - but we should probably discuss the capabilities of those agentic entities, not individual benchmark-gaming components of such entities...
The thing is, humans are also lousy outside their training distribution. This is less obvious because our training distributions vary so much. But the phenomenon where some problem or technological need has been unsolved for many years, and then three groups solve it almost simultaneously, is generally because solving almost any hard problems requires combining about 3-5 other ideas. Consider one that takes 5. It’s pretty much impossible until 3 of those ideas have been invented and publicized. Then it’s really, really, hard: you have to spot three things are relevant and how to combine them, then come up with two separate great ideas to fill the gaps. But once 4 of them are done, the threshold drops: now you only have to spot and combine 4 things and come up with 1, and once the 5th one has come along, all you have to do is spot the pieces and figure out how to put them together. So as progress continues, the problem gets drastically easier. And then suddenly three groups solve the same problem, by assembling the same mix of ideas, one of which is recent.
LLMs can combine things that have never previously be combined in new ways and can thus successfully extrapolate outside the training distribution. Currently, they’re superhuman at knowing about all the ideas that have been come up with as of their knowledge cutoff – that’s a breadth of knowledge skill where they easily outperform humans – and clearly less good at figuring out how to assemble them, or especially at inventing a new missing idea to fill in gaps.
My question is, are those two skills both ones they are always going to be subhuman at, or are they just things they’re currently bad at? Their capabilities are so spiky compared to humans, it’s hard to be sure, but there are plenty of things where people said “LLMs are extremely bad at X”, and they were right at the time, but a few years and model generations later LLMs caught up, and are no longer bad at X. So I’m not going to be astonished if both of these go the same way.
Now, LLMs are very, very good at standing on the shoulders of giants. So it’s easy to mistake them for smarter than they really are. but current models still have plenty of things they’re subhuman at, as well as quite a few things they’re superhuman at. But they average out at somewhere in the rough vicinity of a grad student or an intern working for a few hours. Who are not generally the people who come up with new inventions,
Not sure why the go-to examples for out-of-distribution problems tend to be the extreme of an entirely new theory or invention. To make progress on this problem, we’d want to identify minimally-OOD problems and benchmark those, wouldn’t we?
Melanie Mitchell and collaborators showed weaknesses in LLM on OOD tasks with simple perturbations in the alphabet for string-analogy tasks. This seems like the sort of example we should generally be thinking about and testing, because they’re likely much more tractable, toy domains or simple ad hoc tasks that deviate from strong biases in the training distribution.
Demonstrating this with simpler less-challenging tasks should give us some idea whether this is an area that LLMs are poor-but-improving at and will sooner-or-later catch up, or genuinely bad at and always will be for some architectural reasons. Sounds like a good idea, but not something I know anything about (and a bit too capabilities related for my taste)
My fundamental rule-of-thumb on this sort of issue is that it’s conjectured that SGD with suitable hyperparameters approximates Bayesian learning. If that’s correct, then Bayesian learning is optimal, modulo issues like training dataset, choice of priors/inductive biases, etc. So a comparative difference with humans would then have to come down to things like the quality of approximation of Bayesian learning, the priors/inductive biases, the choice of pretraining dataset, curriculum learning effects, or architectural limitations that make certain things nigh impossible for the LLM (for example lack of continual learning, or a text only transformer doing work on video or audio data).
For my example of combining preexisting ideas in the right way to solve a well-known problem, most of the impressive human examples of that involve months of work, so the current LLMs lack of continual learning and task horizon in the hours range is going to make doing that nigh impossible at the moment. Humans generally work their ways out of distribution slowly, one small step at a time, by gradually expanding the distribution in an interesting-looking direction: that’s what the Scientific method/Bayesian learning is. Doing that without continual learning is inherently limited. So I find Jeremy Howard’s observation that LLMs are bad at this unsurprising: I think it basically reduces to two of the widely-known deficiencies that LLMs currently have (and the industry was already busily working on).
I actually implemented my own private benchmark last year to try to test this with different models. The domain was a toy OOD task where the system had access to three possible tools that performed simple transformations on a configuration of binary values in a particular spatial arrangement. Stage 1 was exploration. The system was given a certain number of steps to probe with the tools (which were chosen randomly from a subset prior to each trial). After the experimentation stage, the system was required to use the tools to perform a transformation on a random arrangement to make it match a target one.
The exercise of building a benchmark was a great learning experience for me. My main takeaway was that differences in performance were nearly all driven by differences in scaffolding, and not so much the base model. This made me fairly disillusioned about benchmarks in general. Made me suspect that gains in benchmarks like ARC-AGI are mostly driven by scaffolding improvements. Maybe someone here has much more insight into that.
But it also made me think that the problem is probably not some far-out radically intractable problem. You mention continual learning and long time horizons. Just generally for OOD tasks, the system needs to be able to log results, generate and revise hypotheses, and carry out Bayesian updates in an iterative manner. Whether that can be cracked reliably for increasingly difficult problems with relatively straightforward scaffolding, or the base models need to be radically improved along with scaffolding, I don’t really know. Maybe for the much more difficult problems (like a Theory of Everything or a cure for the common cold) those advances are very far out. I would think though, that for simple and medium-difficulty problems, the frontier labs are already well on their way.
So with decent scaffolding (search, summarization, etc) and 1m-token context memory, one can do quite a lot even without a robust solution to continual learning? That matches the current situations for quite a lot of agentic tasks.
ARC-AGi is notorious for being insoluble without scaffolding (e.g domain-specific languages), and strongly scaffolding-dependent with it. Scores on it do depend somewhat on model capacity, but are also strongly dependent on the effort and skill put into building scaffolding for it. What would impress me most would be a score where the model built its own scaffolding with only some small amount of human assistance (ideally, zero)
I’m not sure I would use terms like Lipschitz continuity, KL divergence, spurious oscillations, OOD divergence or something else that would highlight the point, but when I imagine myself in a coworker / tech lead / management role working with human software engineers before 2024 vs myself as a software engineer working with LLM-powered coding assistants in 2026, there is a very clear difference in the kinds of “outside” with regards of training distribution in human-human vs human-LLM interactions, the latter being really really fucking annoying tiring shit in every single interaction, while the former is “it depends” (a.k.a. “hiring a team that will be a good match together”).
The agentic scaffolds of 2025+ are making it possible to work around some of the fundamental jaggedness of LLM base models which are still complete shit at “understanding” so we are collectively moving ever more problems into “within distribution” instead of “divergent extrapolation”, sure, so I agree it’s totally unpredictable if LLM-powered tools will be able to automate tasks enough to become the kind of dangerous agents for which it makes sense to reason about theoretic-rational instrumental goals even if LLMs alone might remain shit at goal-orientednes forever (or if we need different architecture) - but we should probably discuss the capabilities of those agentic entities, not individual benchmark-gaming components of such entities...