Not sure why the go-to examples for out-of-distribution problems tend to be the extreme of an entirely new theory or invention. To make progress on this problem, we’d want to identify minimally-OOD problems and benchmark those, wouldn’t we?
Melanie Mitchell and collaborators showed weaknesses in LLM on OOD tasks with simple perturbations in the alphabet for string-analogy tasks. This seems like the sort of example we should generally be thinking about and testing, because they’re likely much more tractable, toy domains or simple ad hoc tasks that deviate from strong biases in the training distribution.
Demonstrating this with simpler less-challenging tasks should give us some idea whether this is an area that LLMs are poor-but-improving at and will sooner-or-later catch up, or genuinely bad at and always will be for some architectural reasons. Sounds like a good idea, but not something I know anything about (and a bit too capabilities related for my taste)
My fundamental rule-of-thumb on this sort of issue is that it’s conjectured that SGD with suitable hyperparameters approximates Bayesian learning. If that’s correct, then Bayesian learning is optimal, modulo issues like training dataset, choice of priors/inductive biases, etc. So a comparative difference with humans would then have to come down to things like the quality of approximation of Bayesian learning, the priors/inductive biases, the choice of pretraining dataset, curriculum learning effects, or architectural limitations that make certain things nigh impossible for the LLM (for example lack of continual learning, or a text only transformer doing work on video or audio data).
For my example of combining preexisting ideas in the right way to solve a well-known problem, most of the impressive human examples of that involve months of work, so the current LLMs lack of continual learning and task horizon in the hours range is going to make doing that nigh impossible at the moment. Humans generally work their ways out of distribution slowly, one small step at a time, by gradually expanding the distribution in an interesting-looking direction: that’s what the Scientific method/Bayesian learning is. Doing that without continual learning is inherently limited. So I find Jeremy Howard’s observation that LLMs are bad at this unsurprising: I think it basically reduces to two of the widely-known deficiencies that LLMs currently have (and the industry was already busily working on).
I actually implemented my own private benchmark last year to try to test this with different models. The domain was a toy OOD task where the system had access to three possible tools that performed simple transformations on a configuration of binary values in a particular spatial arrangement. Stage 1 was exploration. The system was given a certain number of steps to probe with the tools (which were chosen randomly from a subset prior to each trial). After the experimentation stage, the system was required to use the tools to perform a transformation on a random arrangement to make it match a target one.
The exercise of building a benchmark was a great learning experience for me. My main takeaway was that differences in performance were nearly all driven by differences in scaffolding, and not so much the base model. This made me fairly disillusioned about benchmarks in general. Made me suspect that gains in benchmarks like ARC-AGI are mostly driven by scaffolding improvements. Maybe someone here has much more insight into that.
But it also made me think that the problem is probably not some far-out radically intractable problem. You mention continual learning and long time horizons. Just generally for OOD tasks, the system needs to be able to log results, generate and revise hypotheses, and carry out Bayesian updates in an iterative manner. Whether that can be cracked reliably for increasingly difficult problems with relatively straightforward scaffolding, or the base models need to be radically improved along with scaffolding, I don’t really know. Maybe for the much more difficult problems (like a Theory of Everything or a cure for the common cold) those advances are very far out. I would think though, that for simple and medium-difficulty problems, the frontier labs are already well on their way.
So with decent scaffolding (search, summarization, etc) and 1m-token context memory, one can do quite a lot even without a robust solution to continual learning? That matches the current situations for quite a lot of agentic tasks.
ARC-AGi is notorious for being insoluble without scaffolding (e.g domain-specific languages), and strongly scaffolding-dependent with it. Scores on it do depend somewhat on model capacity, but are also strongly dependent on the effort and skill put into building scaffolding for it. What would impress me most would be a score where the model built its own scaffolding with only some small amount of human assistance (ideally, zero)
Not sure why the go-to examples for out-of-distribution problems tend to be the extreme of an entirely new theory or invention. To make progress on this problem, we’d want to identify minimally-OOD problems and benchmark those, wouldn’t we?
Melanie Mitchell and collaborators showed weaknesses in LLM on OOD tasks with simple perturbations in the alphabet for string-analogy tasks. This seems like the sort of example we should generally be thinking about and testing, because they’re likely much more tractable, toy domains or simple ad hoc tasks that deviate from strong biases in the training distribution.
Demonstrating this with simpler less-challenging tasks should give us some idea whether this is an area that LLMs are poor-but-improving at and will sooner-or-later catch up, or genuinely bad at and always will be for some architectural reasons. Sounds like a good idea, but not something I know anything about (and a bit too capabilities related for my taste)
My fundamental rule-of-thumb on this sort of issue is that it’s conjectured that SGD with suitable hyperparameters approximates Bayesian learning. If that’s correct, then Bayesian learning is optimal, modulo issues like training dataset, choice of priors/inductive biases, etc. So a comparative difference with humans would then have to come down to things like the quality of approximation of Bayesian learning, the priors/inductive biases, the choice of pretraining dataset, curriculum learning effects, or architectural limitations that make certain things nigh impossible for the LLM (for example lack of continual learning, or a text only transformer doing work on video or audio data).
For my example of combining preexisting ideas in the right way to solve a well-known problem, most of the impressive human examples of that involve months of work, so the current LLMs lack of continual learning and task horizon in the hours range is going to make doing that nigh impossible at the moment. Humans generally work their ways out of distribution slowly, one small step at a time, by gradually expanding the distribution in an interesting-looking direction: that’s what the Scientific method/Bayesian learning is. Doing that without continual learning is inherently limited. So I find Jeremy Howard’s observation that LLMs are bad at this unsurprising: I think it basically reduces to two of the widely-known deficiencies that LLMs currently have (and the industry was already busily working on).
I actually implemented my own private benchmark last year to try to test this with different models. The domain was a toy OOD task where the system had access to three possible tools that performed simple transformations on a configuration of binary values in a particular spatial arrangement. Stage 1 was exploration. The system was given a certain number of steps to probe with the tools (which were chosen randomly from a subset prior to each trial). After the experimentation stage, the system was required to use the tools to perform a transformation on a random arrangement to make it match a target one.
The exercise of building a benchmark was a great learning experience for me. My main takeaway was that differences in performance were nearly all driven by differences in scaffolding, and not so much the base model. This made me fairly disillusioned about benchmarks in general. Made me suspect that gains in benchmarks like ARC-AGI are mostly driven by scaffolding improvements. Maybe someone here has much more insight into that.
But it also made me think that the problem is probably not some far-out radically intractable problem. You mention continual learning and long time horizons. Just generally for OOD tasks, the system needs to be able to log results, generate and revise hypotheses, and carry out Bayesian updates in an iterative manner. Whether that can be cracked reliably for increasingly difficult problems with relatively straightforward scaffolding, or the base models need to be radically improved along with scaffolding, I don’t really know. Maybe for the much more difficult problems (like a Theory of Everything or a cure for the common cold) those advances are very far out. I would think though, that for simple and medium-difficulty problems, the frontier labs are already well on their way.
So with decent scaffolding (search, summarization, etc) and 1m-token context memory, one can do quite a lot even without a robust solution to continual learning? That matches the current situations for quite a lot of agentic tasks.
ARC-AGi is notorious for being insoluble without scaffolding (e.g domain-specific languages), and strongly scaffolding-dependent with it. Scores on it do depend somewhat on model capacity, but are also strongly dependent on the effort and skill put into building scaffolding for it. What would impress me most would be a score where the model built its own scaffolding with only some small amount of human assistance (ideally, zero)