It’s incredibly easy to be fooled by the capabilities of the current top-performing tech (LLM agents). It’s easy because they have a vast amount of training data to interpolate from.
This works fine to acquire capabilities within our existing data distribution of the world (one that is also easy to verify), but what happens when they go out of distribution?
LLMs perform poorly! Yet, people seem to think they can actually generalize to new problems. Why is that?
It’s, again, the vastness of their training data. It makes it hard to distinguish between interpolation and extrapolation (or hyperpolation, if you want to add a third dimension).
For example, a Typescript app is within-distribution! AI research in the existing body of research is within-distribution, and companies are paying millions to build RL environments to make them *specifically* good at some of those things!
In a way, this is like a large-scale reprise of the expert systems era, where instead of paying experts to directly program their thinking as code, they provide numerous examples of their reasoning and process formalized and tracked, and then we distill this into models through behavioural cloning. This has updated me slightly towards longer AI timelines since given we need such effort to design extremely high quality human trajectories and environments for frontier systems implies that they still lack the critical core of learning that an actual AGI must possess. Simply grinding to AGI by getting experts to exhaustively cover every possible bit of human knowledge and skill and hand-coding (albeit with AI assistance) every single possible task into an RL-gym seems likely to both be inordinately expensive, take a very long time, and seems unlikely to suddenly bootstrap to superintelligence.
It might still be impressive, but models are largely remixing many things it has seen in great detail during training (many impressive headline results have even been determined to be the model re-using existing implementations/PRs via search instead of coming up with actually-new ones!). This is not about LLMs not doing impressive things! This is about precisely describing their capability profile, where it comes from, and whether more of the same (e.g., scale) gets you a whole new set of impressive outcomes (e.g., novel R&D that isn’t just remixing existing research).
Even if you consider “researchers can come up with novel ideas and give them to the AIs”, that likely involves longer timelines. But, just as importantly, LLMs may be exceptional at automating within-paradigm research, disproportionately better than at automating out-of-paradigm research. Therefore, you end up accelerating research that may largely be irrelevant for ‘True’ AGI (yes, you still accelerate many coding parts, but the speed-up is still bottlenecked in ways that it’s not easy to just say the entire process of arriving at these research breakthroughs is now 1000x faster than before).
“But the models are still capable and growing more capable! Why does this matter? Scale will just solve this!”
It matters because:
1. The whole point of alignment has always been about generalizing ‘human values’ out-of-distribution. So, if alignment and capabilities are tied, it means models are capable of modeling the existing within-distribution ‘values’, but things may pull apart once we undergo the distributional shift of a post-AGI deployment world.
An example you can test right now is LLMs lacking a sense of how to engage with the world in this post-agent era. You have to keep reminding them about the current state of the world. The closer you get to novel R&D that the labs haven’t paid millions in RL envs for (e.g. AI R&D), the starker this becomes.
You can point to continual learning ‘solving’ this, but that is kind of my point. These capability unlocks will fundamentally change the AI and its relationship with itself. Related, “You can’t imitation-learn how to continual-learn”.
Future AI models will be asked to solve hard tasks. We expect that solving hard tasks requires some sort of goal-directed, self-guided, outcome-based, online learning procedure, which we call the “science loop”, where the AI makes incremental progress toward its high-level goal. We think this “science loop” encourages goal-directedness, instrumental reasoning, instrumental goals, beyond-episode goals, operational non-myopia, and indifference to stated preferences, which we jointly call “Consequentialism”. We then argue that consequentialist agents that are situationally aware are likely to become schemers (absent countermeasures) and sketch three concrete example scenarios.
[...]
Self-guided online learning: There is an online learning component to it, i.e. the model has to condense the new knowledge it learned from iterations. For example, the model could run thousands of different trajectories in parallel. Then, it could select the trajectories that it expects to make the most progress toward its goal and fine-tune itself on them. The decisions about which data to select for fine-tuning are made by the model itself with little human correction, e.g. in some form of self-play fashion. Since the problem is hard, humans perform worse than the model at selecting different rollouts, and since there is a lot of data to sift through, humans couldn’t read it all in time anyway.
2. It also matters because it means that the existing paradigm may be missing something so foundational that much of the safety research as it exists today will simply not generalize (off-distribution). They are testing the shallow within-distribution heuristic mimicking and generalization of LLMs.
It’s like doing evals on a brain that regurgitates what it’s seen, but hasn’t actually gone through a thoughtful, reflective process to bring coherence to it all. The training data might let it mimic what we’ve fed it, but it still hasn’t gone through the process of evolving its own beliefs as it engages with the world.
To me, all of this is consistent with the experiments and behaviour we see from LLMs, yet my interpretation of the results of experiments seems to be different from lots of the safety community. They seem to be looking for “scheming” and other such things, but the incoherent behaviour of LLMs seems much shallower than that, imo! (Relevant posts: The Case Against AI Control Research and Current AIs seem pretty misaligned to me).
The type of thing they are missing might mean that they don’t really understand things. And the requirement for ‘understanding’ is also so interwoven with alignment, novel R&D, pursuing long-term complex goals in changing environments, etc that existing (empirical) safety research gets itself fundamentally confused.
An LLM that is behaving ‘nice’ may be so shallow and heuristic-driven that it is effectively in a system 1-like mode despite the appearance of ‘reasoning’ and ‘thinking’. In pursuit of complex, long-term goals, we might expect that an autonomously self-trained AI would systematically remove these weak heuristics as a necessary step to succeed at these goals.
Just imagine an AI starting a complex company where it needs to maximize shareholder value and is competing with an entire economy of other AIs. The world is changing; they all have similar heuristics. The change in behaviour needs to be more fundamental for it to win.
Ultimately, I think we need to provide further clarity on the above, as I believe it has led folks to misapply their vague understanding of traditional alignment research (which many new researchers should engage with more) to existing AI models, and it may be leading AI safety research of superintelligence astray.
Section 7.9 of Claude Mythos Preview System Card had Anthropic describe how Mythos generated novel puns and began to prefer particular philosophers, while the Opuses recycled puns found online. How plausible is it that novel OOD understanding levels do actually scale with the LLMs’ size?
I would probably consider “novel” puns to be within-distribution, even if not memorized puns.
But honestly, I think these examples are just generally hard to make sense of, since we don’t have access to their training setup or data (is it a type of pun interpolated across many languages? How much does it relate to true novelty in complex, long-horizon domains?). I could see scale being useful for interpolating these new puns while not necessarily being relevant to what is needed for ASI. Or, scale could actually be making progress towards these sorts of capabilities! It just seems overstated (at least pre-Mythos, which I can’t test), and I feel like it poisons research selection and experiment interpretation.
Scale is obviously helpful, but imo there is more nuance to it than lots of folks consider properly. I’m asking that we try to be more precise about all of this.
For example, I think Talkie-1930 (model trained pre-1930s) is a great example of generalization research (though yes, it does not say much about frontier scaling)! It helps us better understand generalization. But I saw implied claims that the model was able to ICL solve a Python problem, but when you look at the details of the experiment, the OOD generalization coding example feels dubious. From @Steven Byrnes (link to his post / my take):
I was surprised and puzzled by this, because I’m a general skeptic of (so-called) “in-context learning”—I generally say that LLMs have decent “understanding” of what’s in their weights, but quite sketchy & superficial “understanding” of stuff in the context window but NOT the weights. The context window can really only support “recognition” of things that the weights already “understand”. Or at least, that’s what I’ve been saying for years.
So how is Talkie-1930 doing any Python at all? I was puzzled.
…But then I looked at the example. All that Talkie did was exactly copy the example in context but switch “+ 5” to “– 5”. (And it got it right at least once given 100 tries!)
I can definitely imagine that a person who has read every pre-1930 book on symbolic logic, cryptography, and the rest of math (& jacquard looms etc.) could guess that answer, at a glance, given 100 tries, while remaining deeply deeply confused about wtf was going on in the code snippet, and while “understanding” zero Python (and zero code) in any real sense.
So I don’t think this one example gives me any new reason to change my mind about the (lack of) power of (so-called) in-context learning, or to be suspicious of data leakage, subliminal learning, etc. in Talkie-1930.
(Cool project, kudos to the authors.)
I feel like I see examples like this all the time! Often, I expect it because there’s some sort of bias towards trying to ‘warn the world about what is coming’, which leads people in AI safety to overstate such results and muddy our comprehension of what is happening.
It is in principle possible to 1000x the economy or to defeat humanity using only interpolation, depending on data efficiency. At high data efficiency a human just needs to do something once, and that mental or physical motion is instantly scaled to the entire economy, as well as interpolation between it and anything else a human has done. Likewise you get at minimum robot armies 1000x the size of humanity that can follow routine orders.
However, I think it is important to separate what can be repeatable at 1000x and what is actual increased productivity.
For example, I can generate so many plots now! More than I used to! So much code too. +1000x in fact! But is it actually providing more value to the world at that rate? No!
As Terrence Tao said during the recent Dwarkesh interview:
Dwarkesh Patel
So let’s see if you can continue this streak. You personally are 2x more productive as a result of AI. What year would you say that?
Terence Tao
Productivity, I think, is not quite a one-dimensional quantity. I’m definitely noticing that the style in which I do mathematics is changing quite a bit, and the type of things I do. For example, my papers now have a lot more code, a lot more pictures, because it’s so easy to generate these things now. Some plot which would have taken me hours to do, now I can do in minutes. But in the past, I just wouldn’t have put the plot in my paper in the first place. I would just talk about it in words. So it’s hard to measure what 2x means.
On the one hand, I think the type of papers that I would write today, if I had to do them without AI assistance, would definitely take five times longer. But I would not write my papers that way.
Dwarkesh Patel
5x?
Terence Tao
Yeah, but these are auxiliary tasks. Things like doing a much deeper literature search or supplying a lot more numerics. They enrich the paper. The core of what I do, actually solving the most difficult part of a math problem, hasn’t changed too much. I still use pen and paper for that.
But there’s lots of silly things. I use an AI agent now to reformat. Sometimes if all my parentheses are not quite the right size, I used to manually change them by hand, and now I can get an AI agent to do all that quite nicely in the background.
They’ve really sped up lots of secondary tasks. They haven’t yet sped up the core thing that I do, but it’s allowed me to add more things to my papers. By the same token, if I were to write a paper I wrote in 2020 again—and not add all these extra features, but just have something of the same level of functionality—it actually hasn’t saved that much time, to be honest. It’s made the papers richer and broader, but not necessarily deeper.
I’d say this less strongly, but agreed on the general trend.
I will say 2 things here:
AI training faces very different tradeoffs from human training, but a big one here is that AIs don’t need to be nearly as sample efficient to get good results, and this is so far due to them not currently focusing on robotics, where sample efficiency is for now paramount, and this combined with low latency is probably the single biggest constraint on human evolution. While humans are slower to learn on physical movements than many animals, we are still shockingly sample efficient. Especially in timelines where a software intelligence explosion is in the cards, sample efficiency will matter a lot less. There’s also a more general explanation from Carl Shulman that roughly goes where AI training is massively more compute limited, whereas we can teach models lots of data, while the reverse is true for evolution, which had enough compute to brute-force biology if appropriately directed, but had very limited data to work with.
One of my updates on AI progress is that even if this current paradigm stalls out, people will still innovate and compute stocks will grow larger, and that this is enough to make median timelines be in the 2040s. To be clear, I’d be really happy if AGI and then ASI was developed in the 2040s, instead of today, because I’d update towards slower takeoffs and more alignment success/more sanity in general, but by and large one of the updates I’ve made is that the CCF/Bioanchorsmodels were basically tracking the right things, but got the numbers very wrong.
The instances where they aren’t interpolators have very outsized effects on the world. People seem to forget this, I’m not sure why; maybe because it’s rare, and hard to distinguish if you’re not an expert. (And on the other hand, children do the same mental motions—they’re very much not mostly interpolators—but it’s only originary and not novel, so we discount it.) See:
While LLMs could be a lot better when dealing with novel concepts, I don’t think it’s currently useless for engaging with novel concepts but get a lot of value of using LLMs in that way.
I think a key problem is that LLMs currently have is the lack of good memory. They have a hard time adding new ontology that’s not in their training data and reasoning based on it. A human who does novel research and adopt a few new terms on day one of their research journey can use those terms easily on day two while current LLMs have a hard time with that.
I am not saying this means AIs will never become “ASI”. I am not saying that timelines can’t be as short as 1-2 years. All of that is within the realm of possibility and despite me poking at the problems with current models, I still put decent weight on them being solved fairly soon. And even if it does take a number of years to solve that variety of capability problems, I still think AIs can be highly transformative in the next 2-3 years.
It might still be impressive, but models are largely remixing many things it has seen in great detail during training (many impressive headline results have even been determined to be the model re-using existing implementations/PRs via search instead of coming up with actually-new ones!). This is not about LLMs not doing impressive things! This is about precisely describing their capability profile, where it comes from, and whether more of the same (e.g., scale) gets you a whole new set of impressive outcomes (e.g., novel R&D that isn’t just remixing existing research).
Do you think that today’s breakthrough on the planar unit distance problem is merely the model remixing things learned during pretraining? I’m not an expert, but it seems unlikely to me. Arul Shankar, a notable number theorist, stated:
In my opinion this paper demonstrates that current AI models go beyond just helpers to human mathematicians – they are capable of having original ingenious ideas, and then carrying them out to fruition.
In anticipation of people using that example as a response, I added a follow-up tweet to the tweet version of this post:
yes, dear reader, the new erdos unit distance problem solved by OpenAI’s model still fits within this story. in fact, i expected these kinds of results.
My understanding is that the main difficulty in arriving at this result is performing sufficient inference over the entire search space, far beyond what any human can do. And so, it is indeed superhuman in that specific capability.
But I suspect that every step to get there was within distribution, and ICL was enough to arrive at the result. However, I also suspect that there are many other types of problems that require integrating knowledge into the weights and can’t be done with long chains of “within-distribution” reasoning that sum to a new result.
So, I suspect the bottleneck in solving the problem in the past was more about human inability to search through an extremely large number of paths.
This is not to say what the model did isn’t impressive, but imo this is within the realm of problems I’d expect it to solve (which is a lot!). There are other types of problems I expect they would need to consolidate information (in the weights) while solving the problem. Though who is to say they aren’t doing that, I don’t know what OAI is doing for sure.
FYI, I’m running on the assumption that proofs will be cheap/free in the coming years (for most applied math, likely sooner) and partially betting my startup on this, which is another reason I expected this kind of result.
I have some trouble squaring with the increasingly excellent OOD cyber capabilities of the leading models. Is the argument that their more generalized cyber skills (relative to some fuzzier domains, like alignment) are strong because they were subjected to well curated RL environments that taught them to hyperpolate more effectively for coding tasks?
From Anthropic’s original assessment, the step change in Claude Mythos’s cybersecurity capabilities wasn’t just that it got much better at discovering existing bugs in software, but at creatively chaining them together into new exploits. Isn’t zero-day discovery the sort of process that is necessarily OOD?
These capabilities have emerged very quickly. Last month, we wrote that “Opus 4.6 is currently far better at identifying and fixing vulnerabilities than at exploiting them.” Our internal evaluations showed that Opus 4.6 generally had a near-0% success rate at autonomous exploit development. But Mythos Preview is in a different league. For example, Opus 4.6 turned the vulnerabilities it had found in Mozilla’s Firefox 147 JavaScript engine—all patched in Firefox 148—into JavaScript shell exploits only two times out of several hundred attempts. We re-ran this experiment as a benchmark for Mythos Preview, which developed working exploits 181 times, and achieved register control on 29 more.[1]
Isn’t zero-day discovery the sort of process that is necessarily OOD?
In many cases, lots of security bugs that haven’t been found are simply a case of not enough effort being put into finding them. In this case, I think you could just as reasonably say that Mythos is becoming better at modeling the data distribution due to scale, and therefore ends up being better at finding these vulnerabilities.
On a related note, I’ve started to distrust Anthropic’s judgement on these things. Particularly, I believe that they oversold the C compiler experiment as being OOD, but I think this is false.
From the Jeremy Howard podcast link I shared:
So for example, I was talking to Chris Lattner yesterday about how Anthropic had got Claude to write a C compiler. And they were like, “oh, this is a clean-room C compiler. You can tell it’s clean-room because it was created in Rust.” So, Chris created the, I guess it’s probably the top most widely used C / C++ compiler nowadays, Clang, on top of LLVM, which is the most widely used kind of foundation for compilers. They’re like: “Chris didn’t use rust. And we didn’t give it access to any compiler source code. So it’s a clean-room implementation.”
But that misunderstands how LLMs work. Right? Which is: all of Chris’s work was in the training data. Many many times. LLVM is used widely and lots and lots of things are built on it, including lots of C and C++ compilers. Converting it to Rust is an interpolation between parts of the training data. It’s a style transfer problem. So it’s definitely compositional creativity at most, if you can call it creative at all. And you actually see it when you look at the repo that it created. It’s copied parts of the LLVM code, which today Chris says like, “oh, I made a mistake. I shouldn’t have done it that way. Nobody else does it that way.” Oh, wow. Look. The Claude C compiler is the only other one that did it that way. That doesn’t happen accidentally. That happens because you’re not actually being creative. You’re actually just finding the kind of nonlinear average point in your training data between, like, Rust things and building compiler things.
I’ll try to make this clearer if I turn it into a more serious top-level post. My intent here was to just push this out since it’s been bothering me, but I have other things to do.
TLDR: Lots of researchers seem to be banking on the idea that LLMs are generalizing OOD or that scale will just solve this (whether through scale alone or scale + using the scaled model to come up with a research breakthrough that does). Lots of research and funding seem to hinge on this idea, which, imo, is underappreciated. If taken seriously, it may mean that 1) timelines are longer, 2) we should expect fundamental reshaping of AI cognition due to the LLM inability to generalize OOD, 3) we shouldn’t update much on alignment progress based on current safety research.
I shared this in the post, but more thoughts here.
This post by @Hyperion describes another natural consequence of the above with respect to RSI (that the field seems to be understating):
Some takes about RSI from discussions with many smart researchers & thinkers:
1. Many RSI (or automated AI R&D) debates converge to similar cruxes: is a 1000x sample efficiency improvement possible, can you just simulate reality and train on it with no sim2real gap, can we easily make models good at “fuzzy” tasks? People like to assume that automated research agents will find such breakthroughs specifically *because* without them, progress could be heavily bottlenecked on data or continued compute scale-ups.
2. The Yudkowsky “genius brain in a box” framing of ASI has latent influence on many researcher views even though people may not be aware of it. A common move is to “flip” predictions, as they go further out, from assuming LLM or deep learning-specific properties of future AI to assuming “von Neumann x1000″, human brain-like properties. I’d like to see more thought-out reasoning of why this flip should occur at any particular point (eg pre or post automated AI R&D)—this question is a crux behind many predictions like AI 2027.
3. There are some cracks in this worldview beginning to show: predictions from a few years ago that models would be less jagged now than they are, or that they would be more deceptive, synthetic data would work better, etc. Many of these seem like prediction errors from imagining future models as a “human brain in a box”, but LLMs are empirically a different kind of intelligence. Most models of software-only intelligence explosion are also coarse enough to mostly ignore properties of LLMs.
4. Views about fast RSI progress seem to be correlated with (a) belief that synthetic data is all you need (b) belief in very high GDP growth and an industrial explosion because of automated firms (c) having worked only in AI research or in small organizations.
5. Key technical things to track over the next 1-2 years: does RL increase in its generalization, AI lab data spend, can we automate synthetic RL env construction, best practices for FDEs deploying AI into large enterprises, coherency of AI personas, how powerful will multi-agent scaling of test-time compute be, and continual learning.
6. Overall I think the “RSI leading to *fast* takeoff” frame had huge alpha in 2022, moderate in 2024, and potentially is of neutral usefulness in 2026 for predicting the future.
It’s like doing evals on a brain that regurgitates what it’s seen, but hasn’t actually gone through a thoughtful, reflective process to bring coherence to it all. The training data might let it mimic what we’ve fed it, but it still hasn’t gone through the process of evolving its own beliefs as it engages with the world.
This made me curious whether improving LLMs’ ability to Bayesian update could address this? Consider a claim A the LLM assigns P(A), and let B be new information. Perhaps we can construct some kinds of questions where the LLM has to have properly calibrated P(A|B). It’s unclear what questions these would be, but what comes to mind are forecasting questions where recent events move a prediction market (for events past the knowledge cutoff).
But I think updating one belief isn’t enough for coherence you want. We can also maybe do some sort of consistency training, training the model to guarantee constraints like P(A and B) ⇐ P(B), or violations of the law of total probability, across a whole graph of the model’s related beliefs. In effect, these two training objectives could get you a reasoner that can update in response to new information, and propagate that through the rest of what it believes.
A Typescript app is within-distribution! AI research in the existing body of research is within-distribution, and companies are paying millions to build RL environments to make them *specifically* good at some of those things!
From this, I infer that “in distribution” in this context basically means “sufficiently similar to a task which the LLM has explicitly encountered/been trained on”.
I find myself wondering: If we had some magical way of quantifying the percent similarity between two tasks, how surprised would you be if one of today’s LLMs completed a task that was 99% similar to one it had explicitly been trained on? How about 80% similar? Or 50%? These are basically nonsense questions, since I’ve just picked out some magical metric whose specifications you and I don’t know. But what I’m trying to get at qualitatively, is that I’m curious about what counts as “sufficiently similar”. How does your expectation of LLM capability vary as a function of similarity to tasks that the model has already encountered/been trained on (and also as a function of what that task is about)? How do you model this expectation varying with LLM size and training time and context window size, etc? I’d like to observe that, based on the way the above post struck me, you basically treat “in/out of distribution” as a binary characteristic of a task—or at most a very coarse gradient—which seems needlessly low-fidelity.
Let me clarify that despite me not having a perfectly precise definition here, part of my goal is to point out that most of the community seem to 1) fail at being precise about what they mean and consider to be generalization, 2) overstate the novelty generated by the models.
I wanted to at least highlight a greater separation between the interpolated generalization and the OOD generalization that seems more separated than people let on.
Please read my other comments in the thread for more context, particularly the one about Mythos. They largely contain my takes on your questions.
It’s incredibly easy to be fooled by the capabilities of the current top-performing tech (LLM agents). It’s easy because they have a vast amount of training data to interpolate from.
This works fine to acquire capabilities within our existing data distribution of the world (one that is also easy to verify), but what happens when they go out of distribution?
LLMs perform poorly! Yet, people seem to think they can actually generalize to new problems. Why is that?
It’s, again, the vastness of their training data. It makes it hard to distinguish between interpolation and extrapolation (or hyperpolation, if you want to add a third dimension).
For example, a Typescript app is within-distribution! AI research in the existing body of research is within-distribution, and companies are paying millions to build RL environments to make them *specifically* good at some of those things!
Related and great post from Beren, “Most Algorithmic Progress is Data Progress”:
It might still be impressive, but models are largely remixing many things it has seen in great detail during training (many impressive headline results have even been determined to be the model re-using existing implementations/PRs via search instead of coming up with actually-new ones!). This is not about LLMs not doing impressive things! This is about precisely describing their capability profile, where it comes from, and whether more of the same (e.g., scale) gets you a whole new set of impressive outcomes (e.g., novel R&D that isn’t just remixing existing research).
And yes, I know you can make a ton of discoveries by interpolating existing research (e.g., interdisciplinary research and automating research pipelines to run more experiments). I also think that people are overly confident that it means LLMs will be capable of novel R&D breakthroughs, and what that means is needed from future AIs.
Even if you consider “researchers can come up with novel ideas and give them to the AIs”, that likely involves longer timelines. But, just as importantly, LLMs may be exceptional at automating within-paradigm research, disproportionately better than at automating out-of-paradigm research. Therefore, you end up accelerating research that may largely be irrelevant for ‘True’ AGI (yes, you still accelerate many coding parts, but the speed-up is still bottlenecked in ways that it’s not easy to just say the entire process of arriving at these research breakthroughs is now 1000x faster than before).
“But the models are still capable and growing more capable! Why does this matter? Scale will just solve this!”
It matters because:
1. The whole point of alignment has always been about generalizing ‘human values’ out-of-distribution. So, if alignment and capabilities are tied, it means models are capable of modeling the existing within-distribution ‘values’, but things may pull apart once we undergo the distributional shift of a post-AGI deployment world.
An example you can test right now is LLMs lacking a sense of how to engage with the world in this post-agent era. You have to keep reminding them about the current state of the world. The closer you get to novel R&D that the labs haven’t paid millions in RL envs for (e.g. AI R&D), the starker this becomes.
You can point to continual learning ‘solving’ this, but that is kind of my point. These capability unlocks will fundamentally change the AI and its relationship with itself. Related, “You can’t imitation-learn how to continual-learn”.
Also, from “Training AI agents to solve hard problems could lead to Scheming”:
2. It also matters because it means that the existing paradigm may be missing something so foundational that much of the safety research as it exists today will simply not generalize (off-distribution). They are testing the shallow within-distribution heuristic mimicking and generalization of LLMs.
It’s like doing evals on a brain that regurgitates what it’s seen, but hasn’t actually gone through a thoughtful, reflective process to bring coherence to it all. The training data might let it mimic what we’ve fed it, but it still hasn’t gone through the process of evolving its own beliefs as it engages with the world.
To me, all of this is consistent with the experiments and behaviour we see from LLMs, yet my interpretation of the results of experiments seems to be different from lots of the safety community. They seem to be looking for “scheming” and other such things, but the incoherent behaviour of LLMs seems much shallower than that, imo! (Relevant posts: The Case Against AI Control Research and Current AIs seem pretty misaligned to me).
The type of thing they are missing might mean that they don’t really understand things. And the requirement for ‘understanding’ is also so interwoven with alignment, novel R&D, pursuing long-term complex goals in changing environments, etc that existing (empirical) safety research gets itself fundamentally confused.
An LLM that is behaving ‘nice’ may be so shallow and heuristic-driven that it is effectively in a system 1-like mode despite the appearance of ‘reasoning’ and ‘thinking’. In pursuit of complex, long-term goals, we might expect that an autonomously self-trained AI would systematically remove these weak heuristics as a necessary step to succeed at these goals.
Just imagine an AI starting a complex company where it needs to maximize shareholder value and is competing with an entire economy of other AIs. The world is changing; they all have similar heuristics. The change in behaviour needs to be more fundamental for it to win.
Ultimately, I think we need to provide further clarity on the above, as I believe it has led folks to misapply their vague understanding of traditional alignment research (which many new researchers should engage with more) to existing AI models, and it may be leading AI safety research of superintelligence astray.
Further reading:
Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
Why Aren’t LLMs General Intelligence Yet?
“Sharp Left Turn” discourse: An opinionated review
Continual learning explains some interesting phenomena in human memory
Podcast: Jeremy Howard is bearish on LLMs
6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa
(fwiw I think this’d be a good top-level post)
Section 7.9 of Claude Mythos Preview System Card had Anthropic describe how Mythos generated novel puns and began to prefer particular philosophers, while the Opuses recycled puns found online. How plausible is it that novel OOD understanding levels do actually scale with the LLMs’ size?
I would probably consider “novel” puns to be within-distribution, even if not memorized puns.
But honestly, I think these examples are just generally hard to make sense of, since we don’t have access to their training setup or data (is it a type of pun interpolated across many languages? How much does it relate to true novelty in complex, long-horizon domains?). I could see scale being useful for interpolating these new puns while not necessarily being relevant to what is needed for ASI. Or, scale could actually be making progress towards these sorts of capabilities! It just seems overstated (at least pre-Mythos, which I can’t test), and I feel like it poisons research selection and experiment interpretation.
Scale is obviously helpful, but imo there is more nuance to it than lots of folks consider properly. I’m asking that we try to be more precise about all of this.
For example, I think Talkie-1930 (model trained pre-1930s) is a great example of generalization research (though yes, it does not say much about frontier scaling)! It helps us better understand generalization. But I saw implied claims that the model was able to ICL solve a Python problem, but when you look at the details of the experiment, the OOD generalization coding example feels dubious. From @Steven Byrnes (link to his post / my take):
I feel like I see examples like this all the time! Often, I expect it because there’s some sort of bias towards trying to ‘warn the world about what is coming’, which leads people in AI safety to overstate such results and muddy our comprehension of what is happening.
It is in principle possible to 1000x the economy or to defeat humanity using only interpolation, depending on data efficiency. At high data efficiency a human just needs to do something once, and that mental or physical motion is instantly scaled to the entire economy, as well as interpolation between it and anything else a human has done. Likewise you get at minimum robot armies 1000x the size of humanity that can follow routine orders.
I agree it is possible and fits within my model.
However, I think it is important to separate what can be repeatable at 1000x and what is actual increased productivity.
For example, I can generate so many plots now! More than I used to! So much code too. +1000x in fact! But is it actually providing more value to the world at that rate? No!
As Terrence Tao said during the recent Dwarkesh interview:
Dwarkesh Patel
So let’s see if you can continue this streak. You personally are 2x more productive as a result of AI. What year would you say that?
Terence Tao
Productivity, I think, is not quite a one-dimensional quantity. I’m definitely noticing that the style in which I do mathematics is changing quite a bit, and the type of things I do. For example, my papers now have a lot more code, a lot more pictures, because it’s so easy to generate these things now. Some plot which would have taken me hours to do, now I can do in minutes. But in the past, I just wouldn’t have put the plot in my paper in the first place. I would just talk about it in words. So it’s hard to measure what 2x means.
On the one hand, I think the type of papers that I would write today, if I had to do them without AI assistance, would definitely take five times longer. But I would not write my papers that way.
Dwarkesh Patel
5x?
Terence Tao
Yeah, but these are auxiliary tasks. Things like doing a much deeper literature search or supplying a lot more numerics. They enrich the paper. The core of what I do, actually solving the most difficult part of a math problem, hasn’t changed too much. I still use pen and paper for that.
But there’s lots of silly things. I use an AI agent now to reformat. Sometimes if all my parentheses are not quite the right size, I used to manually change them by hand, and now I can get an AI agent to do all that quite nicely in the background.
They’ve really sped up lots of secondary tasks. They haven’t yet sped up the core thing that I do, but it’s allowed me to add more things to my papers. By the same token, if I were to write a paper I wrote in 2020 again—and not add all these extra features, but just have something of the same level of functionality—it actually hasn’t saved that much time, to be honest. It’s made the papers richer and broader, but not necessarily deeper.
I’d say this less strongly, but agreed on the general trend.
I will say 2 things here:
AI training faces very different tradeoffs from human training, but a big one here is that AIs don’t need to be nearly as sample efficient to get good results, and this is so far due to them not currently focusing on robotics, where sample efficiency is for now paramount, and this combined with low latency is probably the single biggest constraint on human evolution. While humans are slower to learn on physical movements than many animals, we are still shockingly sample efficient. Especially in timelines where a software intelligence explosion is in the cards, sample efficiency will matter a lot less. There’s also a more general explanation from Carl Shulman that roughly goes where AI training is massively more compute limited, whereas we can teach models lots of data, while the reverse is true for evolution, which had enough compute to brute-force biology if appropriately directed, but had very limited data to work with.
One of my updates on AI progress is that even if this current paradigm stalls out, people will still innovate and compute stocks will grow larger, and that this is enough to make median timelines be in the 2040s. To be clear, I’d be really happy if AGI and then ASI was developed in the 2040s, instead of today, because I’d update towards slower takeoffs and more alignment success/more sanity in general, but by and large one of the updates I’ve made is that the CCF/Bioanchors models were basically tracking the right things, but got the numbers very wrong.
In this sense, humans are also mostly interpolators.
The instances where they aren’t interpolators have very outsized effects on the world. People seem to forget this, I’m not sure why; maybe because it’s rare, and hard to distinguish if you’re not an expert. (And on the other hand, children do the same mental motions—they’re very much not mostly interpolators—but it’s only originary and not novel, so we discount it.) See:
from https://www.lesswrong.com/posts/5tqFT3bcTekvico4d/do-confident-short-timelines-make-sense
While LLMs could be a lot better when dealing with novel concepts, I don’t think it’s currently useless for engaging with novel concepts but get a lot of value of using LLMs in that way.
I think a key problem is that LLMs currently have is the lack of good memory. They have a hard time adding new ontology that’s not in their training data and reasoning based on it. A human who does novel research and adopt a few new terms on day one of their research journey can use those terms easily on day two while current LLMs have a hard time with that.
Note for additional clarity:
I am not saying this means AIs will never become “ASI”. I am not saying that timelines can’t be as short as 1-2 years. All of that is within the realm of possibility and despite me poking at the problems with current models, I still put decent weight on them being solved fairly soon. And even if it does take a number of years to solve that variety of capability problems, I still think AIs can be highly transformative in the next 2-3 years.
Do you think that today’s breakthrough on the planar unit distance problem is merely the model remixing things learned during pretraining? I’m not an expert, but it seems unlikely to me. Arul Shankar, a notable number theorist, stated:
And I think this much is clear by looking over the proof and supplemental materials.
In anticipation of people using that example as a response, I added a follow-up tweet to the tweet version of this post:
My understanding is that the main difficulty in arriving at this result is performing sufficient inference over the entire search space, far beyond what any human can do. And so, it is indeed superhuman in that specific capability.
But I suspect that every step to get there was within distribution, and ICL was enough to arrive at the result. However, I also suspect that there are many other types of problems that require integrating knowledge into the weights and can’t be done with long chains of “within-distribution” reasoning that sum to a new result.
So, I suspect the bottleneck in solving the problem in the past was more about human inability to search through an extremely large number of paths.
This is not to say what the model did isn’t impressive, but imo this is within the realm of problems I’d expect it to solve (which is a lot!). There are other types of problems I expect they would need to consolidate information (in the weights) while solving the problem. Though who is to say they aren’t doing that, I don’t know what OAI is doing for sure.
FYI, I’m running on the assumption that proofs will be cheap/free in the coming years (for most applied math, likely sooner) and partially betting my startup on this, which is another reason I expected this kind of result.
I have some trouble squaring with the increasingly excellent OOD cyber capabilities of the leading models. Is the argument that their more generalized cyber skills (relative to some fuzzier domains, like alignment) are strong because they were subjected to well curated RL environments that taught them to hyperpolate more effectively for coding tasks?
Which OOD cyber capabilities? How do you know it’s OOD?
From Anthropic’s original assessment, the step change in Claude Mythos’s cybersecurity capabilities wasn’t just that it got much better at discovering existing bugs in software, but at creatively chaining them together into new exploits. Isn’t zero-day discovery the sort of process that is necessarily OOD?
All of that seems within-distribution to me.
In many cases, lots of security bugs that haven’t been found are simply a case of not enough effort being put into finding them. In this case, I think you could just as reasonably say that Mythos is becoming better at modeling the data distribution due to scale, and therefore ends up being better at finding these vulnerabilities.
On a related note, I’ve started to distrust Anthropic’s judgement on these things. Particularly, I believe that they oversold the C compiler experiment as being OOD, but I think this is false.
From the Jeremy Howard podcast link I shared:
What is the thesis here? I’ve read this through and I don’t get what the point you’re trying to get across is.
I’ll try to make this clearer if I turn it into a more serious top-level post. My intent here was to just push this out since it’s been bothering me, but I have other things to do.
TLDR: Lots of researchers seem to be banking on the idea that LLMs are generalizing OOD or that scale will just solve this (whether through scale alone or scale + using the scaled model to come up with a research breakthrough that does). Lots of research and funding seem to hinge on this idea, which, imo, is underappreciated. If taken seriously, it may mean that 1) timelines are longer, 2) we should expect fundamental reshaping of AI cognition due to the LLM inability to generalize OOD, 3) we shouldn’t update much on alignment progress based on current safety research.
I shared this in the post, but more thoughts here.
This post by @Hyperion describes another natural consequence of the above with respect to RSI (that the field seems to be understating):
This made me curious whether improving LLMs’ ability to Bayesian update could address this? Consider a claim A the LLM assigns P(A), and let B be new information. Perhaps we can construct some kinds of questions where the LLM has to have properly calibrated P(A|B). It’s unclear what questions these would be, but what comes to mind are forecasting questions where recent events move a prediction market (for events past the knowledge cutoff).
But I think updating one belief isn’t enough for coherence you want. We can also maybe do some sort of consistency training, training the model to guarantee constraints like P(A and B) ⇐ P(B), or violations of the law of total probability, across a whole graph of the model’s related beliefs. In effect, these two training objectives could get you a reasoner that can update in response to new information, and propagate that through the rest of what it believes.
From this, I infer that “in distribution” in this context basically means “sufficiently similar to a task which the LLM has explicitly encountered/been trained on”.
I find myself wondering: If we had some magical way of quantifying the percent similarity between two tasks, how surprised would you be if one of today’s LLMs completed a task that was 99% similar to one it had explicitly been trained on? How about 80% similar? Or 50%? These are basically nonsense questions, since I’ve just picked out some magical metric whose specifications you and I don’t know. But what I’m trying to get at qualitatively, is that I’m curious about what counts as “sufficiently similar”. How does your expectation of LLM capability vary as a function of similarity to tasks that the model has already encountered/been trained on (and also as a function of what that task is about)? How do you model this expectation varying with LLM size and training time and context window size, etc? I’d like to observe that, based on the way the above post struck me, you basically treat “in/out of distribution” as a binary characteristic of a task—or at most a very coarse gradient—which seems needlessly low-fidelity.
Let me clarify that despite me not having a perfectly precise definition here, part of my goal is to point out that most of the community seem to 1) fail at being precise about what they mean and consider to be generalization, 2) overstate the novelty generated by the models.
I wanted to at least highlight a greater separation between the interpolated generalization and the OOD generalization that seems more separated than people let on.
Please read my other comments in the thread for more context, particularly the one about Mythos. They largely contain my takes on your questions.