Jeremy Howard was recently[1] interviewed on the Machine Learning Street Talk podcast: YouTube link, interactive transcript, PDF transcript.
Jeremy co-invented LLMs in 2018, and taught the excellent fast.ai online course which I found very helpful back when I was learning ML, and he uses LLMs all the time, e.g. 90% of his new code is typed by an LLM (see below).
So I think his “bearish”[2] take on LLMs is an interesting datapoint, and I’m putting it out there for discussion.
Some relevant excerpts from the podcast, focusing on the bearish-on-LLM part, are copied below! (These are not 100% exact quotes, instead I cleaned them up for readability.)
So you know Piotr Woźniak, who’s a guy I really respect, who kinda rediscovered spaced repetition learning, built the SuperMemo system, and is the modern day guru of memory: The entire reason he’s based his life around remembering things is because he believes that creativity comes from having a lot of stuff remembered, which is to say, putting together stuff you’ve remembered in interesting ways is a great way to be creative.
LLMs are actually quite good at that.
But there’s a kind of creativity they’re not at all good at, which is, you know, moving outside the distribution….
You have to be so nuanced about this stuff because if you say “they’re not creative”, it can give the wrong idea, because they can do very creative seeming things.
But if it’s like, well, can they really extrapolate outside the training distribution? The answer is no, they can’t. But the training distribution is so big, and the number of ways to interpolate between them is so vast, we don’t really know yet what the limitations of that is.
But I see it every day, because my work is R&D. I’m constantly on the edge of and outside the training data. I’m doing things that haven’t been done before. And there’s this weird thing, I don’t know if you’ve ever seen it before, but I see it multiple times every day, where the LLM goes from being incredibly clever to, like, worse than stupid, like not understanding the most basic fundamental premises about how the world works. And it’s like, oh, whoops, I fell outside the training data distribution. It’s gone dumb. And then, like, there’s no point having that discussion any further because you’ve lost it at that point.
…
I mean, I think they can’t go outside their distribution because it’s just something that that type of mathematical model can’t do. I mean, it can do it, but it won’t do it well.
You know, when you look at the kind of 2D case of fitting a curve to data, once you go outside the area that the data covers, the curves disappear off into space in wild directions, you know. And that’s all we’re doing, but we’re doing it in multiple dimensions. I think Margaret Boden might be pretty shocked at how far “compositional creativity” can go when you can compose the entirety of the human knowledge corpus. And I think this is where people often get confused, because it’s like—
So for example, I was talking to Chris Lattner yesterday about how Anthropic had got Claude to write a C compiler. And they were like, “oh, this is a clean-room C compiler. You can tell it’s clean-room because it was created in Rust.” So, Chris created the, I guess it’s probably the top most widely used C / C++ compiler nowadays, Clang, on top of LLVM, which is the most widely used kind of foundation for compilers. They’re like: “Chris didn’t use rust. And we didn’t give it access to any compiler source code. So it’s a clean-room implementation.”
But that misunderstands how LLMs work. Right? Which is: all of Chris’s work was in the training data. Many many times. LLVM is used widely and lots and lots of things are built on it, including lots of C and C++ compilers. Converting it to Rust is an interpolation between parts of the training data. It’s a style transfer problem. So it’s definitely compositional creativity at most, if you can call it creative at all. And you actually see it when you look at the repo that it created. It’s copied parts of the LLVM code, which today Chris says like, “oh, I made a mistake. I shouldn’t have done it that way. Nobody else does it that way.” Oh, wow. Look. The Claude C compiler is the only other one that did it that way. That doesn’t happen accidentally. That happens because you’re not actually being creative. You’re actually just finding the kind of nonlinear average point in your training data between, like, Rust things and building compiler things.
…
I’m much less familiar with math than I am computer science, but from talking to mathematicians, they tell me that that’s also what’s happening with like, Erdős problems and stuff. It’s some of them are newly solved. But they are not sparks of insight. You know, they’re solving ones that you can solve by mashing up together very closely related things that humans have already figured out.
…
The thing is, none of these guys have been software engineers recently. I’m not sure Dario’s ever been a software engineer at all. Software engineering is a unusual discipline, and a lot of people mistake it for being the same as typing code into an IDE. Coding is another one of these style transfer problems. You take a specification of the problem to solve and you can use your compositional creativity to find the parts of the training data which interpolated between them solve that problem, and interpolate that with syntax of the target language, and you get code.
There’s a very famous essay by Fred Brooks written many decades ago, No Silver Bullet, and it almost sounded like he was talking about today. He was pointing to something very similar, which is, in those days, it was all like, “oh, what about all these new fourth generation languages and stuff like that, you know, we’re not gonna need any software engineers anymore, because software is now so easy to write, anybody can write it”. And he said, well, he guessed that you could get at maximum a 30% improvement. He specifically said a 30% improvement in the next decade, but I don’t think he needed to limit it that much. Because the vast majority of work in software engineering isn’t typing in the code.
So in some sense, parts of what Dario said were right: for quite a few people now, most of their code is being typed by a language model. That’s true for me. Say, like, maybe 90%. But it hasn’t made me that much more productive, because that was never the slow bit. It’s also helped me with kind of the research a lot and figuring out, you know, which files are gonna be touched.
But any time I’ve made any attempt to getting an LLM to like design a solution to something that hasn’t been designed lots of times before, it’s horrible. Because what it actually, every time, gives me is the design of something that looks on its surface a bit similar. And often that’s gonna be an absolute disaster, because things that look on the surface a bit similar and like I’m literally trying to create something new to get away from the similar thing. It’s very misleading.
…
The difference between pretending to be intelligent and actually being intelligent is entirely unimportant, as long as you’re in the region in which the pretense is actually effective. So it’s actually fine, for a great many tasks, that LLMs only pretend to be intelligent, because for all intents and purposes, it just doesn’t matter, until you get to the point where it can’t pretend anymore. And then you realize, like, oh my god, this thing’s so stupid.
- ^
The podcast was released March 3 2026. Not sure exactly when it was recorded, but it was definitely within the previous month, since they talk about a blog post from Feb. 5.
- ^
I mean, he’s “bearish” compared to the early-2026 lesswrong zeitgeist—which really isn’t saying much!
I’ve had a similar experience in trying to have research discussions with LLMs. Every time I poke at my own conceptual confusion on a topic they just seem to kind of break down: saying inconsistent stuff in loops, retreating back to what has already been said on the topic. They’re even worse than this, since they do also often get really basic stuff wrong. E.g., just the other day Claude told me that the k-complexity of a random string was the same as that of a crystal. This was in the context of a probably confusing conversation for it where I was trying to more deeply grok and so really push on the confusions around complexity measures, still, it’s pretty revealing (imo) that this still happens. Overall, LLMs seem pretty incoherent to me, and incapable of having “real,” “novel,” or “scientific” thoughts; I don’t feel like I can trust them with anything important.
But I’m always wondering if it is me who is crazy here, as my social environment seems to believe that LLMs are formidable forces of intellect, getting better by the year. My own sense-making of this situation is similar to Jeremy’s: it does seem like something is getting better, just something more along the lines of ~”filling in between the lines of what is already known” and less “raw intelligence,” whatever that is. But it’s of course impossible to talk about any of these things or to even really know what the difference is and so on, and in hearing more and more hype about LLMs getting better at coding, and not being much of a coder myself, I have been worrying that own experience isn’t very representative. Maybe you can just get excellence out if you train super hard on a given domain, I don’t know. But also, maybe people are pointing at the same sort of thing when they say LLMs are “good at coding” as they are when they say they are getting smarter. So it’s an interesting data point for me, to see Jeremy describe it as such here.
This is not that far off from my own experience, including the part about wondering whether I’m crazy / whether it’s just me / whether there’s something I’m missing.
(Except for the “not being much of a coder” thing—I do a lot of coding, and have for many years, and the hype is very confusing to me. I’ve been using coding assistants on and off for around a year now, starting with Sonnet 3.7; at the time, they offered me nothing more than an incremental speed-up in certain atypically well-scoped tasks that I myself was relatively bad at, whereas now, with the latest models and harnesses, they… still offer me an incremental speed-up in certain atypically well-scoped tasks that I myself am relatively bad at. I’ve actually stopped using them entirely, recently, because I got fed up with having to maintain the cruft they wrote.)
That said, I do find LLMs very useful in a variety of ways, and have even found it helpful to discuss research with them at times.
I’m not sure I agree with the notion that they’re only good “inside the training distribution,” because it’s not clear what counts as “being inside the training distribution” if we’re conceding that LLMs can synthesize attributes which each appeared somewhere in training but never co-occurred in a single training example. Once the concept of “inside the training distribution” has been widened to include “everything anyone has ever written and anything that can be formed by ‘recombining’ those texts in arbitrarily abstract ways,” what doesn’t count as inside the training distribution? Of course one can always point at a failure post-hoc and say “oh, I must have strayed outside the distribution,” but this explains nothing unless the failures can be predicted in advance, and it’s not clear to me that they can.
The most useful heuristic I have for when today’s LLMs seem brilliant vs. dumb is, instead, that their failures are very often the result of them not having sufficient context about what’s going on and what exactly you need from them[1].
The lack of context is so often the bottleneck that, these days, I basically think of Claude as being strictly smarter than me in every way provided that Claude knows all the relevant information I know… which, of course, Claude never really does. But the more I supply that info to Claude—the closer I get to that asymptotic limit of Claude really, truly knowing exactly what my current research problem is and exactly what dead ends I’ve tried and why I think they failed etc. etc. -- the closer Claude gets to matching, or exceeding, my own level of competence.
And so I’m always thinking to myself, these days, about how to “get context into Claude.” If I’m doing something on a computer instead of asking Claude to do it, it’s because I have decided that getting the requisite context into Claude would be more onerous than just doing the whole thing myself. (Which is usually the case in practice.)
And this observation makes me feel like I’m going crazy when I see all the hype about LLM agents, about long-running autonomy and “one-shotting” and so on.
Because: “getting context into Claude” is not a task that Claude is very good at doing!
The reason for this is intuitive: it’s a cold start problem, a bootstrap paradox, whatever you want to call it. Claude is weak and unreliable until it has enough context—which means that if you let Claude handle the task of giving itself context, it will do so weakly and unreliably. And since its performance at everything else is so strongly gated on this one foundational step, executing that step weakly and unreliably will have catastrophic consequences.
Instead, I tend to keep things on a pretty tight leash—with a workflow that looks more like a carefully pre-designed and inflexible sequence of stages than an “agent” doing whatever it feels like—and I do a lot of verbose and detailed writing work to spell out everything that matters, in detail.
This is true both when I’m chatting directly with Claude (or another LLM), and when I’m writing code that uses LLMs to process data or make decisions. In both cases, I write very long and detailed messages (or message templates), in which I put a lot of effort into things like “clarifying potential confusions before they arise” and “including arguably extraneous information because it helps ‘set the scene’ in which the work is taking place” and “giving the LLM tips on how to think about the problem and how to check its work to verify it isn’t making a mistake I’ve seen it make before.”
When this works, it really works; I have seen Claude perform some pretty remarkable feats while inside this kind of “information-rich on-rails experience,” ones that impressed me much more than any of the high-autonomy agentic one-shotting stuff that the hype is focused on[2]. But it is a very different approach from the sort of thing that is getting hyped, and it requires a lot of upfront manual effort that often isn’t worth it.
Arguably this is sort of like the “training distribution” story, except adjusted to take in-context learning into account.
You can teach the LLM new tricks without re-training it—but you need to give it enough information to precisely specify the tricks in question and distinguish them from the many other slightly different tricks which you, or someone else, might hypothetically have wanted it to learn instead. After all, it has been trained so that it could work with many different users who all want different things, and it has no way of “determining which one of the users you are” except by the use of distinguishing factors that you provide to it.
More speculatively, I think even some of the cases when the LLM says something really dumb or vacuous—as opposing to “correctly solving a task, but the wrong one”—might really be further instances of “correctly solving the wrong task,” only at a different level of abstraction.
For all it knows at the outset, you might be the sort of person who would be most satisfied by some hand-wavey low-effort bluster that’s phrased in an eye-catching manner but which seems obviously stupid to a reader with sufficient expertise if they’re paying enough attention. So there’s an element of “proving yourself to the LLM,” of demonstrating in-context that you are that expert-who’s-paying-attention as opposed to, like, the median LM Arena rater or something.
To clarify, I mean that the stuff I’ve seen is more impressive specifically because I can somewhat-reliably elicit it once I’ve done all the laborious setup work, whereas when I try to skip that work and just let Claude Code make all the decisions, it usually makes those decisions wrong and ends up lost in some irrelevant cul-de-sac, beating its head against an irrelevant wall.
If the hype were actually representative—if the agent actually could do the equivalent of my “laborious setup work” and bootstrap itself into a state where it knows enough to meaningfully contribute—that would of course be a very different story.
Fwiw I similarly still experience them to be bad at coming up with useful novel math research ideas, even as they’ve gotten much more competent at coding. Though they aren’t great at coding yet either.
However, I don’t think this ‘filling in the blanks’ is something fundamentally different in kind from ‘raw intelligence’. I don’t think there’s a hard boundary here. Anything that isn’t a literal lookup table is applying algorithms to extrapolate what it knows to new situations. Even something as minor as changing the tense of a memorised sentence is novel invention of a sort, just a tiny little bit. I think current llms can’t extrapolate as far as some humans yet, but the average distance they can extrapolate over seems to me to have increased over time. They’re still bad at coming up with novel math research ideas now, but three years ago they were much worse.
Separately from this, llms just know a lot of things most humans don’t, which can make them a value add to some intellectual tasks even if they can’t extrapolate the things they know very far.
I agree that the AIs are pretty bad at handling conceptually confusing stuff. I basically think of them as being incredibly knowledgeable, not that smart, and having huge amounts of intuition on how to program (mostly due to their knowledge and their having read huge amounts of code).
My guess is that for any reasonable operationalization of “raw intelligence”, they’re getting smarter?
These feel like very different unrelated statements to me (not sure if you meant to imply they are connected). I think you can do real chunks of novel/scientific thought while being too incoherent to see it all the way through.
I’m not sure how you’re defining “real”/”novel”/”scientific” thoughts. I’m pretty sure they can and do, the thing they don’t do is persistently and strategically follow through on them and string them together in a useful way.
The thing is, humans are also lousy outside their training distribution. This is less obvious because our training distributions vary so much. But the phenomenon where some problem or technological need has been unsolved for many years, and then three groups solve it almost simultaneously, is generally because solving almost any hard problems requires combining about 3-5 other ideas. Consider one that takes 5. It’s pretty much impossible until 3 of those ideas have been invented and publicized. Then it’s really, really, hard: you have to spot three things are relevant and how to combine them, then come up with two separate great ideas to fill the gaps. But once 4 of them are done, the threshold drops: now you only have to spot and combine 4 things and come up with 1, and once the 5th one has come along, all you have to do is spot the pieces and figure out how to put them together. So as progress continues, the problem gets drastically easier. And then suddenly three groups solve the same problem, by assembling the same mix of ideas, one of which is recent.
LLMs can combine things that have never previously be combined in new ways and can thus successfully extrapolate outside the training distribution. Currently, they’re superhuman at knowing about all the ideas that have been come up with as of their knowledge cutoff – that’s a breadth of knowledge skill where they easily outperform humans – and clearly less good at figuring out how to assemble them, or especially at inventing a new missing idea to fill in gaps.
My question is, are those two skills both ones they are always going to be subhuman at, or are they just things they’re currently bad at? Their capabilities are so spiky compared to humans, it’s hard to be sure, but there are plenty of things where people said “LLMs are extremely bad at X”, and they were right at the time, but a few years and model generations later LLMs caught up, and are no longer bad at X. So I’m not going to be astonished if both of these go the same way.
Now, LLMs are very, very good at standing on the shoulders of giants. So it’s easy to mistake them for smarter than they really are. but current models still have plenty of things they’re subhuman at, as well as quite a few things they’re superhuman at. But they average out at somewhere in the rough vicinity of a grad student or an intern working for a few hours. Who are not generally the people who come up with new inventions,
Not sure why the go-to examples for out-of-distribution problems tend to be the extreme of an entirely new theory or invention. To make progress on this problem, we’d want to identify minimally-OOD problems and benchmark those, wouldn’t we?
Melanie Mitchell and collaborators showed weaknesses in LLM on OOD tasks with simple perturbations in the alphabet for string-analogy tasks. This seems like the sort of example we should generally be thinking about and testing, because they’re likely much more tractable, toy domains or simple ad hoc tasks that deviate from strong biases in the training distribution.
Demonstrating this with simpler less-challenging tasks should give us some idea whether this is an area that LLMs are poor-but-improving at and will sooner-or-later catch up, or genuinely bad at and always will be for some architectural reasons. Sounds like a good idea, but not something I know anything about (and a bit too capabilities related for my taste)
My fundamental rule-of-thumb on this sort of issue is that it’s conjectured that SGD with suitable hyperparameters approximates Bayesian learning. If that’s correct, then Bayesian learning is optimal, modulo issues like training dataset, choice of priors/inductive biases, etc. So a comparative difference with humans would then have to come down to things like the quality of approximation of Bayesian learning, the priors/inductive biases, the choice of pretraining dataset, curriculum learning effects, or architectural limitations that make certain things nigh impossible for the LLM (for example lack of continual learning, or a text only transformer doing work on video or audio data).
For my example of combining preexisting ideas in the right way to solve a well-known problem, most of the impressive human examples of that involve months of work, so the current LLMs lack of continual learning and task horizon in the hours range is going to make doing that nigh impossible at the moment. Humans generally work their ways out of distribution slowly, one small step at a time, by gradually expanding the distribution in an interesting-looking direction: that’s what the Scientific method/Bayesian learning is. Doing that without continual learning is inherently limited. So I find Jeremy Howard’s observation that LLMs are bad at this unsurprising: I think it basically reduces to two of the widely-known deficiencies that LLMs currently have (and the industry was already busily working on).
I actually implemented my own private benchmark last year to try to test this with different models. The domain was a toy OOD task where the system had access to three possible tools that performed simple transformations on a configuration of binary values in a particular spatial arrangement. Stage 1 was exploration. The system was given a certain number of steps to probe with the tools (which were chosen randomly from a subset prior to each trial). After the experimentation stage, the system was required to use the tools to perform a transformation on a random arrangement to make it match a target one.
The exercise of building a benchmark was a great learning experience for me. My main takeaway was that differences in performance were nearly all driven by differences in scaffolding, and not so much the base model. This made me fairly disillusioned about benchmarks in general. Made me suspect that gains in benchmarks like ARC-AGI are mostly driven by scaffolding improvements. Maybe someone here has much more insight into that.
But it also made me think that the problem is probably not some far-out radically intractable problem. You mention continual learning and long time horizons. Just generally for OOD tasks, the system needs to be able to log results, generate and revise hypotheses, and carry out Bayesian updates in an iterative manner. Whether that can be cracked reliably for increasingly difficult problems with relatively straightforward scaffolding, or the base models need to be radically improved along with scaffolding, I don’t really know. Maybe for the much more difficult problems (like a Theory of Everything or a cure for the common cold) those advances are very far out. I would think though, that for simple and medium-difficulty problems, the frontier labs are already well on their way.
So with decent scaffolding (search, summarization, etc) and 1m-token context memory, one can do quite a lot even without a robust solution to continual learning? That matches the current situations for quite a lot of agentic tasks.
ARC-AGi is notorious for being insoluble without scaffolding (e.g domain-specific languages), and strongly scaffolding-dependent with it. Scores on it do depend somewhat on model capacity, but are also strongly dependent on the effort and skill put into building scaffolding for it. What would impress me most would be a score where the model built its own scaffolding with only some small amount of human assistance (ideally, zero)
I’m not sure I would use terms like Lipschitz continuity, KL divergence, spurious oscillations, OOD divergence or something else that would highlight the point, but when I imagine myself in a coworker / tech lead / management role working with human software engineers before 2024 vs myself as a software engineer working with LLM-powered coding assistants in 2026, there is a very clear difference in the kinds of “outside” with regards of training distribution in human-human vs human-LLM interactions, the latter being really really fucking annoying tiring shit in every single interaction, while the former is “it depends” (a.k.a. “hiring a team that will be a good match together”).
The agentic scaffolds of 2025+ are making it possible to work around some of the fundamental jaggedness of LLM base models which are still complete shit at “understanding” so we are collectively moving ever more problems into “within distribution” instead of “divergent extrapolation”, sure, so I agree it’s totally unpredictable if LLM-powered tools will be able to automate tasks enough to become the kind of dangerous agents for which it makes sense to reason about theoretic-rational instrumental goals even if LLMs alone might remain shit at goal-orientednes forever (or if we need different architecture) - but we should probably discuss the capabilities of those agentic entities, not individual benchmark-gaming components of such entities...
I think Jeremy is pointing to something real, but I think “interpolation versus extrapolation” in the strict mathematical sense is not quite the correct pointer to what he’s observing.
For one thing, in high-dimensional space, everything is extrapolation.
For another thing, consider that billions of humans over thousands of years invented language, math, science, technology, and everything in the $100T economy, 100% autonomously and from scratch, with no new training data dropped from the heavens. Is that “good extrapolation”? Not really. I don’t think there’s anything in the 30000 BC “training data” that would be “extrapolated” into the ability to travel to the moon and build advanced microchips and so on. I think the metaphor should be more like “creation” or “building”, not “extrapolation”.
The vastness of the training distribution is certainly one feature of the AI situation. But another is an army of human developers of AI, eager to discover what isn’t in the training distribution, and what the AIs can’t currently do, so they can figure out how to give the AIs those new capabilities.
Is there any argument that LLMs, turned into recurrent networks via chain of thought, will still have inherent limitations when compared with humans?
There are various arguments laid out by various people (Ilya, Nathan Lambert, Richard Sutton, Steven Byrnes, Yann LeCunn, Jeremy Howard (this post), Dwarkesh, as well as others many others), which more or less have the vibe of “lack of conceptual innovation/ood generalization/continual learning”. I also personally believe a version of this, but think it’s not as relevant to timelines discussions as one may suspect.
I’d be keen to spell out the case I believe in, as well as my model of the claims around such claims tends to look like. But your question is sufficiently vague such that I don’t have a good understanding of your model and what would be cruxy for you.
P.S. By referring to LLMs as RNNs via chain-of-thought, I’m assuming you’re alluding to the transformers forward pass being in TC^0 and the follow up CoT as P-complete (assuming polynomial # of intermediary tokens) results. But most of the above arguments aren’t blocked by that (i.e. traditional RNNs are also P-Complete but no one argues they’re AGI-complete).
I don’t think “interpolate/extrapolate” is that useful of a framing, for prediction purposes. It has utility, but this piece tries to say too much with it.
It’s an ML classic, sure. But given the dimensionality involved? For any “real” unseen task, some aspects of it will be in “interpolation” regime, and others will inevitably fall outside the hull of training data and into the “extrapolation” regime. “Outside of distribution” gets murky fast as dimensionality increases.
Thus, it’s nigh impossible to truly disentangle poor LLM performance into “failure to interpolate” and “failure to extrapolate”. It’s easy to make the case, but hard to prove it. “LLMs are fundamentally worse at extrapolation than humans are” remains an untested assumption.
It can be outright false or outright true. Or true under the current scales and training methods and false at 2028 SOTA—a quantitive gap, the way 85 IQ humans are notably worse than average at extrapolation. The case for “outright true” is overstated.
One common practical example of a lasting LLM deficiency is spatial reasoning. Why do LLMs perform so poorly at spatial reasoning and “commonsense physics” tasks like that in SimpleBench?
Wrong architecture for the job—something like insufficient depth? Inability to take advantage of test time compute? Failure to extrapolate from text-only training data? Failure to interpolate from the sparse examples of spatial reasoning in the training data? Lack of spatial reasoning priors that humans get from evolved brain wiring? Insufficient scale to converge to a robust world physics model despite the other deficiencies?
We did interrogate the question, and we have some hints, but we don’t have an exact answer. Multiple types of interventions improve spatial reasoning performance in practice, but none have attained human-level spatial reasoning in LLMs as of yet.
It doesn’t seem to be as neat and simple of a story as “LLMs are inherently poor extrapolators” with what’s known so far. And as long as SOTA performance keeps improving generation to generation, I’m not going to put a lot of weight on “the bottleneck is fundamental”.
If you actually look at the number of bits of training data the human brain receives from birth to adulthood, a huge proportion of them are visual data. So I’m not surprised that we’re comparatively good at 3D (and our nervous system very likely also has some good inductive priors for it). I suspect the answer for LLMs is mostly just multimodal models trained on a vast amount of video training data — expensive, though the cost is reducible somewhat by coming up with smarter ways to tokenize video.
That’s a lot of words to just say LLMs currently are super-competent within their training distribution and not good outside of it. I haven’t watched the whole podcast. Do we have good reason to think this particular deficit is unable to be remedied? That making inroads to issues like continual learning won’t enable these sorts of systems to perform much better at ad hoc or out-of-distribution tasks?
Calling it “this particular deficit” is an understatement. To give a bad comparison (but maybe good enough for an illustrative purpose): it’s like calling airplanes’ inability to go into space “a particular deficit”, when the entire design of the vehicle is optimized for something other than going into space, and properly re-optimizing it for properly going into space would amount to making it into something very non-airplane-like.
The main reason the comparison is bad is that, in the limit of human imitation (and RLVR and “generalized current stuff”), you get a complete emulation of human cognition (from the input-output perspective, at least), and it then becomes possible to use it to create a “cleaner” design of cognition that supports relevant aspects of human cognition that are beyond the reach of LLMs. But the limit may be quite far or even not practically achievable, or at least less practical than taking a route whose first step is getting back to the drawing board.
(This is not the same as saying that LLMs cannot be helpful in finding this “cleaner” architecture before reaching this limit.)
Finally, this will sound a bit like a reductio ad absurdum, but it’s relevant for talking about this clearly. What constitutes an “outside” of a training distribution depends on the larger distribution within which that training distribution is (considered as) being placed. Like in math, there is no “objective” complement of a set; a set’s complement exists only with reference to a superset of that set. So “outside of a training distribution” can be anything between [just a slightly larger neighborhood of the distribution], in which the LLM starts surprisingly flailing (relative to what we would expect from a human with those in-distribution capabilities (?)), and the entirety of our world’s (relevant) cognitive domains, the latter being AGI/ASI/A[something]I-complete.
The fact that the concept of “outside of the training distribution” can be so inflated makes me think that it’s often used as a grab bag that hides a lot of complexity, and, in particular, all the complexity of human cognition minus what LLMs can do human-level-well or better.
A human attends board game night. They learn a new game they’ve never played before. Technically, this is out-of-distribution learning. This type of learning does not necessarily seem like augmenting a car for space travel (maybe it is). They are not having to learn all about games and dice and boards and pieces all from scratch. They are having to mostly map existing learned models into a slightly novel combination for a slightly new domain. I’m not saying that’s a trivial thing to do, because it’s a hard open problem that many many smart people have been trying to crack for decades.
But it does not seem as daunting as you are portraying it. Yes, out-of-distribution is a very large space. But there’s an awful lot of that space that we’re simply not interested in learning anyway, so that narrows it down quite a lot.
As another commenter here noted, we probably actually do hope it’s a problem that won’t be cracked anytime soon, though the current effort and resources being spent towards the problem are historically unprecedented. I very well could be underestimating the problem. I guess we’ll just see.
In case you misunderstood, the analogy I was making was
air travel : space travel :: in-distribution learning : out of distribution learning. I was not claiming that getting an LLM to learn an OOD thing is like tuning an airplane to a space rocket. But, as I said, it’s not a great analogy.This seems more like a within-distribution problem: the player is encountering a game that is composed of pieces that are very alike the pieces of the games they’ve previously encountered, and the rules follow a similar logic. I expect that if you invent some game with simple rules that is a not-very-well-thought-through mash of checkers, chess, shogi, go, and the game of Ur, Claude 4.6 will get it.
A better example might be going from normal board games to Baba Is You or something. The ontology (or meta-ontology?) of Baba Is You is very different than that of a vast majority of board games. It’s not like you’re inventing everything from scratch. Old stuff transfers. Someone who has played some games will generally have an easier time learning to play Baba Is You than someone who has never played some games. But some of it transfers in a non-straightforward way, and if you don’t do it right, it breaks.
I wouldn’t call it “daunting”. It’s just … a meaningfully different kind of beast?
But I also don’t see how us not caring about most of the space is supposed to make it easier.
If you want to figure out which one out of 1000 hypotheses is the correct one (in some classification problem, say), you don’t care about the other 999, but it doesn’t help.
If you mean that we only need to extrapolate to some nearby-ish regions of the training distribution, and most of the nearby-ish regions of the training distribution we don’t care about, then it seems to me like you’re looking for some specialized hacks, and I don’t think specialized hacks will work in general / take you “far”. (Feel free to correct me if I’m misinterpreting you.)
Well, that’s one of the big questions, isn’t it? Seems fairly clear there’s no hard boundary between in-distribution and out-of-distribution. Is the cure for cancer and the way to discover it going to be completely OOD? Or is it going to lean heavily on existing knowledge of cell biology, genetics, and all previous cancer research? The common phrasing is ‘standing on the shoulders of giants’. This is pretty well accepted as the way new inventions and discoveries happen. Not as radically alien knowledge that emerges from a vacuum, but an incremental step up using a mountain of existing knowledge bases (analogous to a game composed of pieces very alike ones they’ve previously encountered). Very large discoveries or paradigm shifts are likely more OOD, but the vast bulk of new science is fairly incremental and I would think the sort of problems you’d consider within-distribution. No?
Yeah, this is a vague description of LLMs’ capabilities’ most salient failure mode, but its vagueness (or maybe: our understanding of this phenomenon being low-resolution) doesn’t make it non-real or less significant or easier to overcome.
A mosaic of both, but I also expect that OOD-ish reasoning is common in normal humans, and if you somehow stuck Claude 4.6 in a human body and tasked it with leading a normal human life, it would start doing something weirdly stupid by human standards within the first 1-2 hours and that over time those stupid things would cascade if uncorrected (be it by whoever is overseeing that LLM in human body or by other social forces taking care of a weirdly behaving cyborg).
Never did I claim that “OOD-ish reasoning”/”true creativity” is about summoning new knowledge from the vacuum. In my previous comment, I wrote “Old stuff transfers. [...] But some of it transfers in a non-straightforward way, and if you don’t do it right, it breaks.”.
Sure. AlphaFold and LLMs solving open math problems are examples of this.
I sense that you’re intending this comment to imply/suggest something, but I don’t know what.
It seems totally able to be remedied somehow, but it’s been an open problem for a looong time. It definitely seems like it’ll be one of the last things to fall from current vantage point. But maybe we just accumulate enough unreliable workarounds that it no longer is a severe limitation. I have ideas, hopefully they’re bad ones because I’d rather this not get improved until we’ve figured out how to gain confidence in safety/alignment qualitatively faster than we can right now, enough that open ended RL at test time can be assumed to be asymptotically safe.