4 Scenarios & 2 Modifiers: My Current AI-Progress Models

Wrote up my current thoughts on AI progress. I have four scenarios bundled up by how the general LLM progress goes, and two other factors that might have significant effects and which are mostly independent from said general LLM progress: LLMs’ effects on math research, and time-to-a-better-AGI-paradigm.

“Cheatsheet”:

Are frontier LLMs already generally intelligent?
- Yes. The paradigm is on a continuous trajectory to general superintelligence. (Scenario 1)
- No. Is the LLM paradigm itself AGI-complete?
  - Yes; there will be a discontinuity at which LLMs become AGI. (Scenario 2)
  - No. Can LLMs “close the loop” on in-paradigm AI R&D anyway?
    - Yes. (Scenario 3)
    - No. (Scenario 4)
Can LLMs “solve math”, in the sense of becoming superhuman conjecture-verifiers?
- If yes, what would be the consequences?
How long until someone figures out a complete/more effective AGI paradigm?

1. LLMs and Capability Research

There’s Something Off About LLMs

There are many debates about whether LLMs are AGI-complete or not, and whether they are already basically AGIs or not. Semantics aside, I do still feel that there’s something wrong with them, a certain distinct capability that humans have and they seem to be missing.

I’d previously referred to it as “sleepwalking”. For me, there’s a distinct sense that they are not “all there”. They are not looking at the actual situation, solving the actual problem in front of them; they’re interacting with some overlap of superficially similar problems the templates for which they learned in training. And sure, humans are also susceptible to this:

Everyone is basically living in a dream mashup of their current external situation and whatever old emotional meanings are getting activated by the current situation.

But if humans mix up the current situation with dream-pieces, for LLMs, there seems to be nothing but the dream-pieces.

One distinct way this shows up in their capabilities are “surprising LLM reasoning failures”, as outlined by @Kaj_Sotala here. Some examples off the top of my head:

Failing to count r’s in “strawberry”.
Solving variants of “I can’t operate on him, he’s my son!” incorrectly.
A weird inability to reason logically in some cases.
Sometimes failing basic math (9.8 − 9.11^[1]).
Repeatedly claiming to have understood your explanation about a bug, then doing something absolutely unrelated in order to “fix” it.
Making other nonsensical choices to advance their goals.
An inability to understand when they’re just not accomplishing their goal.
Hallucinating goal-relevant information.

See some potent examples here, here, here.

Any given example of any given failure from this list may disappear in the next model generation. But this class of failures seems to persist. It also doesn’t follow e. g. METR’s task-horizon scaling laws: surely parsing simple trick riddles has to be a “1-3 minutes” task, yet every newly released model falls for some simple trick like this; or otherwise turns out incapable of solving some basic problems even in domains in which they’re supposed to be competent.

Now, there exist various ~~copes~~ analogies to defend this. That those are similar to human visual illusions or cognitive fallacies, that you’d be able to uncover similarly weird human failures if we could scrutinize them as we do LLMs, that LLMs are AGI-complete but with maladaptive “instincts” to complete patterns chiseled into them by the SGD...

I could argue about those explanations, but: yeah, sure, whatever. Suppose it’s one of those.

That’s still a crippling problem, no? And it’s stubbornly not going away so far. CoT reasoning doesn’t help reliably, multi-agent setups forcing agents to check each other’s work don’t help reliably. It’s not just a momentary mistake, not a parallel to a human getting distracted/unfocused for a second and messing up: if they’re given time to think, or are prompted to reflect on their answer, they’d often just reassert their mistake.

Which raises the question: will in-paradigm capability research^[2] make it go away eventually, with enough scale and schlep?

Perhaps. E. g., perhaps it happens at Daniel Kokotajlo’s expected “infinite-length task horizons”. Perhaps, at some point, things will just “click”.

My expectation is that this moment, if it happens, would be a discontinuity. If models “wake up” at some level of capabilities, and abruptly stop failing for this entire class of failures, I expect that’d correspond to some qualitative change in how they function internally.

Or maybe not.

Or maybe this paradigm can never wake up.

Can LLMs Automate LLM R&D?

If LLMs aren’t sleepwalking, or if they are, but will be woken by in-paradigm improvements, then capability research can obviously be automated. It’ll proceed roughly the way the AGI labs expect.

But what if they never wake up?

My expectation is that they won’t be able to close the loop on AI R&D automation. The AGI labs will build vast multi-agent civilization, arrange them into the rough shape of the research-and-development industry, pour oceans of compute into it, but it’ll be a cargo cult that stumbles well before takeoff.

But perhaps not.

Like, LLMs are not useless or absolutely incapable. They are able to produce useful work, sometimes; they are able to ape agent-like behavior, sometimes; and their abilities to do that are growing – perhaps at the rate of “every 4-7 months, the 50%-success task horizons double”.

And in-paradigm capability research does not necessarily require curing them of their current ailments. “In-paradigm research” by definition doesn’t require any “genuine innovation”^[3], and most capability tasks have ground truths: whether loss has gone down, whether benchmarks went up, whether the math theorem was proven. So you don’t really need to trust LLM agents to be able to get themselves out of degenerate loops. If they get stuck in one, at some point that roll-out will be automatically judged unproductive and resampled; iterate until success. Something in the style of AlphaEvolve.

So, perhaps LLMs can automate the process of improving LLMs in the same ways LLMs have been improving since 2022, and massively speed up its pace, to its endpoint.

What is that endpoint?

Well. Not AGI. Recall that we’re in the “LLMs are sleepwalking and won’t wake up” scenario here. (If they’re awake, or will wake up at some point in this on-paradigm self-improvement process, go to paragraph 1 of this section.)

Whatever system this process produces will be incredibly weird. It will have a completely unintuitive, baffling mix of capabilities and incapabilities. Perhaps it’d be able to, in minutes/hours, do any in-distribution task that’d take a human a millennium with 50% probability… and still fall for “what’s 9.9 − 9.11?” or some equivalent, and still be perplexingly incapable of coming up with a new idea to save its life (?!?!).

This would make little intuitive sense to me, but GPT-4′s mix of capabilities and incapabilities also made no intuitive sense to me, and this is a straightforward extrapolation of how things have been going.

Whatever the endpoint of LLM in-paradigm scaling is, it’ll not be an AGI. It’ll be something weird, yet still monstrously capable in some ways.

Which raises the question: will it be able to kill us?

I could see it. I don’t know that defeating humanity requires doing anything innovative. Perhaps in-distribution planning and skill use plus superhuman speeds plus billions of instances will be enough. In which case:

We should pause to note that a Clippy² still doesn’t really think or plan. It’s not really conscious. It is just an unfathomably vast pile of numbers produced by mindless optimization starting from a small seed program that could be written on a few pages. [...] When it ‘plans’, it would be more accurate to say it fake-plans; when it ‘learns’, it fake-learns; when it ‘thinks’, it is just interpolating between memorized data points in a high-dimensional space, and any interpretation of such fake-thoughts as real thoughts is highly misleading; when it takes ‘actions’, they are fake-actions optimizing a fake-learned fake-world, and are not real actions, any more than the people in a simulated rainstorm really get wet, rather than fake-wet. (The deaths, however, are real.)

On balance, I would bet against it; and even further upstream, against LLMs being able to automate even in-paradigm capability research. But maybe.

Summing up, the scenarios:

Are LLMs “sleepwalking”?
- No. The paradigm is on a continuous trajectory to general superintelligence. (1)
- Yes. But will they wake up?
  - Yes, and that will be a discontinuity. (2)
  - No. But can they automate in-paradigm capability research anyway?
    - Yes. (3)
    - No. (4)

For ease of reference, we may dub those (1) “already awake”, (2) “impending awakening”, (3) “sleepwalking Singularity”, and (4) “dead to the world”.

Also for reference, my probability distribution over those is something like 5%, 10%, 5%, 80%, respectively.

Can LLMs Automate LLM Alignment?

I. e.: can the current AI paradigm ultimately automate in-paradigm alignment/control research?

By which I mean things like scheming evals, “non-ambitious” mechanistic interpretability, CoT monitoring and feature-activation monitoring, and various ways of eliciting then training away undesirable behaviors.

The “AI aligning AI” scenarios roughly go as follows:

We use those methods to align AIs at some only-slightly-superhuman capability level where they can’t easily defeat all of our panopticon advantages.
Those “superhuman-1” AIs incrementally improve on this class of methods, and that plus their general superhuman-ness lets them align their successors, “superhuman-2″ AIs.
Superhuman-2 AIs align superhuman-3 AIs, et cetera, until genuine limit-of-computation superintelligence is achieved and aligned.

How does that approach fare under various scenarios?

I don’t feel like getting into the weeds here, but some quick thoughts:

“Already Awake”: I think it can work in theory, but won’t work in practice. As in:
- If LLMs are indeed just baby AGIs, with all the machinery of baby AGIs, and with no features of their high-level functioning changing from here to general superintelligence, then sure. It should be possible to use our crude control and behavior-shaping tools to force just-slightly-superhuman LLMs do what we want, who would be able to use their slightly less crude tools to force LLMs slightly smarter than them do what they want, et cetera. If alignment is a straight line on a graph, it may be possible to use brute force to make it go up and to the right.
  - I. e.: whatever its other problems, under “Already Awake”, the AI-aligns-AI plan is at least not solving the wrong problem using the wrong methods under a wrong model of the world. (The way it does under all other scenarios.)
- But I don’t think it will work in practice. It’s a Godzilla Strategy, and has all the problems of one. It plays a kind of game of telephone, where a dumb monkey at one end is trying to control a god through a chain of noisy relays. Something is going to go at least slightly wrong at some link in that chain – one of the intermediary AIs learns slightly wrong values, or makes a mistake because it’s still stupid enough to make mistakes, or there’s some other miscommunication or glitch^[4] – and “controlling a god slightly incorrectly” ends the world.
- (Man, I feel like I’m still sounding too complimentary towards this plan here. Recall that I’m also giving the only scenario where this doesn’t auto-fail 5%.)
“Impending Awakening”: This plan’s screwed in that one. The “awakening” is a discontinuity, some/all of the alignment tools break, the “awake” next-generation AIs are qualitatively more capable than our previous-generation watchdogs, and it all falls apart.
“Sleepwalking Singularity”: This plan’s screwed in that one too. Capability research can maybe be automated via sleepwalking LLMs, due to easily verifiable ground truth. “Has an easily verifiable ground truth” is famously not the case for human values, so capability progress proceeds apace while alignment progress veers off a cliff.
“Dead-to-the-World”: LLMs can’t automate their own alignment, but it doesn’t matter, because they can’t take off anyway. Yay!

2. LLMs and Math

Can LLMs Do Math?

Consider giving an LLM a well-posed fully formal theorem and asking it to produce its proof/disproof; or to run some other well-posed calculation. Suppose that it’s able to do that. Suppose that, on this type of tasks, given a sufficiently ridiculous yet practically achievable amount of compute, it’s able to do in hours the amount of work that’d take a human mathematician a decade.

That is: for all intents and purposes, formal math has been automated.

This does not necessarily require AGI, because it does not necessarily require coming up with new ideas.

Intuitively, it should. Intuitively, a math task with that large a “time horizon” would involve coming up with various new concepts, thoroughly understanding them, developing a conceptual “felt sense” for them, et cetera. There would be many new ideas, perhaps entire new math domains, left in the wake of this effort.

But that’s our intuitions for how humans do math. LLMs already seem to be doing it a different way.

Humans are bottlenecked on working memory and on long-term memory. We can only keep 7±2 concepts in our mind at once, and we only have as much crystallized intelligence/domain knowledge as could be learned in the human lifetime at human learning speeds.

LLMs are different. Their loose equivalent of “working memory” – context length – is easy to scale. While their ability to manage complexity within that context lags behind, it’s something that also seems possible to incrementally improve with scale & schlep. And they obviously have a much broader set of crystallized heuristics, a much broader knowledge base on which they could easily draw.

This means that:

They can solve a much broader number of math problems without engaging in any novel/”creative” problem-solving. In places where a human would have needed to resort to that, they can draw on some obscure theorem/trick perfectly suited for the problem, or even straight-up lift the answer from their memory of some StackExchange page.
The set of math problems they can solve this way steadily grows, as the number of known math-pieces they’re able to recall and keep in their working memory grows.

See some discussion here, here and here.

For me, the question was never about whether, in principle, this trick could scale to advancing math research.^[5] The question was how well it would scale: whether they’d become better at it than the leading human mathematicians, and if yes, how much better they can get before compute scaling slows down.

Jury’s still out on “how much better”, but there are some reports (e. g., this, this) that they’re already becoming increasingly useful for frontier research...

… precisely in the “literature review”/”proof assistance” capacity. Not in the “come up with new math ideas” capacity.

And indeed: the problem of “invent information theory/Bayesian probability theory/category theory/quantum mechanics” seems qualitatively different from the problem of “is the formal expression $A$ tautologous to the formal expression $B$ ?”. And unlike human mathematicians, LLMs can get much better at solving difficult problems of the latter type without being able to solve problems of the former type.

So:

I think LLMs can basically automate formal math without becoming AGIs.
This is possible under all scenarios, including the “dead-to-the-world” one.
Under the more bullish-on-LLMs scenarios, this is likely to happen much earlier than general superintelligence.

Consequences of Math Automation

What would be the consequences? I have no idea. To quote my earlier comment:

One difficulty with predicting the impact of “solving math” on the world is the Jevons effect (or a kind of generalization of it). If posing a problem formally becomes equivalent to solving it, it would have effects beyond just speeding up existing fully formal endeavors. It might potentially create qualitatively new industries/approaches relying on cranking out such solutions by the dozens.
E. g., perhaps there are some industries which we already can fully formalize, but which still work in the applied-science regime, because building the thing and testing it empirically is cheaper than hiring a mathematician and waiting ten years. But once math is solved, you’d be able to effectively go through dozens of prototypes per day for, say, $1000, while previously, each one would’ve taken six months and $50,000.
Are there such industries? What are they? I don’t know, but I think there’s a decent possibility that merely solving formal math would immediately make things go crazy.

An obvious potential application is for AI R&D. Math in ML papers is often there just to make them look impressive, but perhaps math automation would actually make theory-based DL capability research more resource-efficient than the empirical approach.

Another potential way to exploit this is greatly speeding up the formalization of various fields by doing babble-and-prune search through theorem-space. As in: if you want to formalize some field (such as agent foundations), you can come up with some possible formalization and theorems based on it, then ask an LLM whether the theorems work. If they don’t, you babble a different formalization at it. Eventually, you may stumble on a formalization which unfolds into a rich structure of elegantly interlocking theorems, and that would probably be the “true” theory of whatever-you’re-formalizing. So: this could potentially speed up even non-paradigmic research, by semi-blindly finding a paradigm. (But what important fields are non-paradigmic, besides agent foundations? I don’t know.)

So, things to track:

How well math automation scales. (What’s the largest $N$ for which the following statement is true?: “LLMs would be able to compress $N$ seconds of a mathematician’s work into hours using the amounts of compute realistically available in the near future.”)
What effects automating math would have on the world.

3. AGI by Non-LLM Means

I’ll be brief here; see Steven Byrnes’ Foom & Doom posts (one, two) for more.

I think it’s very plausible that there’s a yet-to-be-discovered non-LLM paradigm which would allow to produce a general superintelligence using dramatically less compute and data than LLMs.

I think it’s true under all scenarios from Section 1:

Under “Already Awake” and “Impending Awakening”, the new paradigm may or may not be discovered before LLMs reach general superintelligence. (It’d be discovered shortly afterwards.)
Under “Sleepwalking Singularity”, it may or may not be discovered before the animated corpuses kill us all.
Under “Dead-to-the-World”, it’s the only way we die to AI at all.

How soon will that happen? Unlike with extrapolating LLM progress, this may not be easily predictable at all. It would require a theoretical breakthrough, and those seem to follow a memoryless exponential distribution; “fusion is always 20 years away”.

To faux-formalize it: we’re operating under Knightian uncertainty regarding the shape of the concept-space through which we need to pass to achieve a theoretical breakthrough, and our estimate regarding how long that would take are always vibes-based “seems hard”, “seems moderately difficult”, etc.

So, there’s no legible way to derive a number. Vibes-wise:

Upper bound: 10 years seem implausibly long, I don’t think there are that many missing pieces.
Lower bound: I dunno, 0-5 years? Depends on how I’m feeling on any given day.
- Maybe one stupid trick stapled on top of the current paradigm works, in which case it may have already happened yesterday.
- Maybe a major architecture and training-loop changes are needed that’d have to be scaled up/optimized, in which case it can’t be less than five years.

Those estimates interact with other variables, and may end up compressed if:

LLMs become capable of independent superhuman research.
LLMs become very useful research assistants.
LLMs automate math, and it turns out very useful for non-paradigmic research.

4. Summary

On my model, there are six core questions to track, grouped by three topics:

Is the LLM paradigm AGI-complete?
- If it is, are frontier LLMs already generally intelligent?
- If it isn’t, can it automate LLM research anyway?
How well can LLMs automate formal math research given realistic computational resources, in terms of “math-research task horizons”?
- What consequences would automating math have on the world?
How long do we have until someone figures out a better AGI paradigm?

Bundling up by possible answers to the first three questions, we get four scenarios:

“Already Awake”: the current frontier LLMs are generally intelligent.
- Capabilities: The progress should proceed in an on-trend way, the way Ryan Greenblatt outlines here. (15% probability of full AI R&D automation by the start of 2029, 45% probability by the start of 2033.)
- AI-aligns-AI: I think the plan is not conceptually confused under this scenario, but too unreliable to work in practice
“Impending Awakening”: the current frontier LLMs will soon become generally intelligent.
- Capabilities: The progress should proceed in an on-trend way, up until a sudden discontinuity at which the GI algorithm is “grokked”. It will then proceed much faster than previously projected. This may happen well before 2030; maybe next year, in line with the more bullish estimates of capability researchers.
- AI-aligns-AI: I think it breaks disastrously at the moment of “awakening”.
“Sleepwalking Singularity”: The LLM paradigm is not AGI-complete, but can automate LLM R&D.
- Capabilities: The progress proceeds in an on-trend way up until LLMs close the LLM research loop, at which point it… fast-forwards to its weird counter-intuitive not-AGI-but-superintelligent endpoint. That endpoint may or may not be capable of omnicide.
- AI-aligns-AI: I think it doesn’t work. Under this scenario, LLMs are only able to do LLM R&D because it has easily verifiable ground-true rewards, which is not the case for even on-paradigm alignment research.
“Dead-to-the-World”: LLMs are not AGI-complete and can’t automate even LLM R&D.
- Capabilities: LLMs aren’t able to autonomously-and-reliably complete any task. No Singularity can be built on their back, because humans remain the rate-limiting step. Research, including non-LLM AI R&D, may nevertheless be greatly sped up, due to LLMs automating formal math, literature search, and large chunks of software engineering.
- AI-aligns-AI: Irrelevant with regards to LLMs; unknowable with regards to the new paradigm.

I should probably assign probabilities to those or something. Uhh...^[6]

Most of my probability mass is still on “Dead-to-the-World” and the more underwhelming sub-scenarios of “Sleepwalking Singularity” (where LLMs’ endpoint is really disappointing and not omnicide-capable). I think we’re yet to see LLMs show any ability to do anything innovative; all ostensible examples are still not what they claim to be on closer examination. Call it 80%.
The more exciting variants of “Sleepwalking Singularity” seem implausible; call it 5%. (Probably too high.)
“Impending Awakening” is uncomfortably plausible, but grows less plausible with each further scale-up. Call it 10%.
“Already Awake”: Pretty sure they’re not; the remaining 5%.

^
Apparently it’s because they confuse those with dates?
^
Scaling, higher-quality data, more RL environments, tinkering with the details of the transformer architecture and loss/reward functions, et cetera.
^
Which is another thing LLMs have so far been utterly inept at. (There’s been some controversy about that in regards to math; I’ll get to that in the corresponding section.) And no, I can’t define “innovation” precisely. But whatever it is, they seem to suck at it.
^
Recall that LLMs don’t suddenly become incredibly reliable here, then look up all the baffling, unpredictable ways they mess up currently.
^
Here’s some proof that I’m not goalpost-moving here.
^
In advance of some possible nitpicking: I’m not going to stand by those exact numbers, they’re to illustrate what I mean by vague words like “most” and “implausible”. In particular, I expect they’d be moderately inconsistent with the numbers I’d get if I assigned probabilities to different ways the “core questions” could turn out.