I don’t think “interpolate/extrapolate” is that useful of a framing, for prediction purposes. It has utility, but this piece tries to say too much with it.
It’s an ML classic, sure. But given the dimensionality involved? For any “real” unseen task, some aspects of it will be in “interpolation” regime, and others will inevitably fall outside the hull of training data and into the “extrapolation” regime. “Outside of distribution” gets murky fast as dimensionality increases.
Thus, it’s nigh impossible to truly disentangle poor LLM performance into “failure to interpolate” and “failure to extrapolate”. It’s easy to make the case, but hard to prove it. “LLMs are fundamentally worse at extrapolation than humans are” remains an untested assumption.
It can be outright false or outright true. Or true under the current scales and training methods and false at 2028 SOTA—a quantitive gap, the way 85 IQ humans are notably worse than average at extrapolation. The case for “outright true” is overstated.
One common practical example of a lasting LLM deficiency is spatial reasoning. Why do LLMs perform so poorly at spatial reasoning and “commonsense physics” tasks like that in SimpleBench?
Wrong architecture for the job—something like insufficient depth? Inability to take advantage of test time compute? Failure to extrapolate from text-only training data? Failure to interpolate from the sparse examples of spatial reasoning in the training data? Lack of spatial reasoning priors that humans get from evolved brain wiring? Insufficient scale to converge to a robust world physics model despite the other deficiencies?
We did interrogate the question, and we have some hints, but we don’t have an exact answer. Multiple types of interventions improve spatial reasoning performance in practice, but none have attained human-level spatial reasoning in LLMs as of yet.
It doesn’t seem to be as neat and simple of a story as “LLMs are inherently poor extrapolators” with what’s known so far. And as long as SOTA performance keeps improving generation to generation, I’m not going to put a lot of weight on “the bottleneck is fundamental”.
If you actually look at the number of bits of training data the human brain receives from birth to adulthood, a huge proportion of them are visual data. So I’m not surprised that we’re comparatively good at 3D (and our nervous system very likely also has some good inductive priors for it). I suspect the answer for LLMs is mostly just multimodal models trained on a vast amount of video training data — expensive, though the cost is reducible somewhat by coming up with smarter ways to tokenize video.
I don’t think “interpolate/extrapolate” is that useful of a framing, for prediction purposes. It has utility, but this piece tries to say too much with it.
It’s an ML classic, sure. But given the dimensionality involved? For any “real” unseen task, some aspects of it will be in “interpolation” regime, and others will inevitably fall outside the hull of training data and into the “extrapolation” regime. “Outside of distribution” gets murky fast as dimensionality increases.
Thus, it’s nigh impossible to truly disentangle poor LLM performance into “failure to interpolate” and “failure to extrapolate”. It’s easy to make the case, but hard to prove it. “LLMs are fundamentally worse at extrapolation than humans are” remains an untested assumption.
It can be outright false or outright true. Or true under the current scales and training methods and false at 2028 SOTA—a quantitive gap, the way 85 IQ humans are notably worse than average at extrapolation. The case for “outright true” is overstated.
One common practical example of a lasting LLM deficiency is spatial reasoning. Why do LLMs perform so poorly at spatial reasoning and “commonsense physics” tasks like that in SimpleBench?
Wrong architecture for the job—something like insufficient depth? Inability to take advantage of test time compute? Failure to extrapolate from text-only training data? Failure to interpolate from the sparse examples of spatial reasoning in the training data? Lack of spatial reasoning priors that humans get from evolved brain wiring? Insufficient scale to converge to a robust world physics model despite the other deficiencies?
We did interrogate the question, and we have some hints, but we don’t have an exact answer. Multiple types of interventions improve spatial reasoning performance in practice, but none have attained human-level spatial reasoning in LLMs as of yet.
It doesn’t seem to be as neat and simple of a story as “LLMs are inherently poor extrapolators” with what’s known so far. And as long as SOTA performance keeps improving generation to generation, I’m not going to put a lot of weight on “the bottleneck is fundamental”.
If you actually look at the number of bits of training data the human brain receives from birth to adulthood, a huge proportion of them are visual data. So I’m not surprised that we’re comparatively good at 3D (and our nervous system very likely also has some good inductive priors for it). I suspect the answer for LLMs is mostly just multimodal models trained on a vast amount of video training data — expensive, though the cost is reducible somewhat by coming up with smarter ways to tokenize video.