I’m broadly interested in AI strategy and want to figure out the most effective interventions to get good AI outcomes.
Thomas Larsen
Thanks for writing this paper.
Why do we need to halt for so long? In short, AI alignment is probably a difficult technical problem, and it is hard to be confident about solutions. Pausing for a substantial period gives humanity time to be careful in this domain rather than rushing. Pausing for a shorter amount of time (e.g., 5 years) might reduce risk substantially compared to the current race, but it also might not be enough. In general, world leaders should weigh the likelihood and consequence of different risks and benefits against each other for different lengths of a pause. Section 2 discusses some of the reasons why the AI alignment problem may be difficult. Generally, experts vary in their estimates of the difficulty of this problem and the likelihood of catastrophe, with some expecting the problem to be very hard [Grace et al., 2025, ControlAI, 2025, Wikipedia, 2025]. Given this uncertainty about how difficult this problem is, we should prepare to pause for a long time, 8 in case more effort is needed. Our agreement would allow for a long halt, even if world leaders later came to believe a shorter one was acceptable. We also contend that there are other problems which need to be addressed during a halt even if one presumes that alignment can be quickly solved, and these problems are also of an uncertain difficulty. These include risks of power concentration, human misuse of AIs, mass-unemployment, and many more. World leaders will likely want at least years to understand and address these problems. The international agreement proposed in this paper is primarily motivated by risks from AI misalignment, but there are numerous other risks that it would also help reduce.
I agree with a lot of this, but I do think this paper ambiguates a bit between “we need to halt for decades” and “we might need to halt for decades”. I agree with the latter but not the former,.
I also think that in the cases where alignment is solvable sooner, then it might matter a lot that we accelerated alignment in the meantime.
I get that it’s scary to have to try to bifurcate alignment and capabilities progress because governments are bad at stuff, but I think it’s a mistake to ban AI research, because it will have very negative consequences on the rate of AI alignment research. I think that we should try hard to figure out what can be done safely (e.g. via things like control evals), and then do alignment work on models that we can empirically study that are as capable as possible while incurring minimal risks.
Serial time isn’t the only input that matters: having smarter AIs is helpful as research assistants and to do experiments directly on the smarter AIs, having lots of compute to do alignment experiments is nice, having lots of money and talent going into AI alignment is helpful. I think you guys should emphasize and think about the function you are trying to maximize more clearly (i.e. how much do you really care about marginal serial time vs marginal serial time with smart AIs to do experiments on).
Thanks, I thought this was a helpful comment. Putting my responses inline in case it’s helpful for people. I’ll flag that I’m a bit worried about confirmation bias / digging my heels in: would love to recognize it if I’m wrong.
How bad is Chinese Superintelligence? For some people, it’s a serious crux whether a China-run superintelligence would be dramatically worse in outcome than a democratic country.
This isn’t a central crux for me I think. I would say that it’s worse, but that I’m willing to make concessions here in order to make alignment more likely to go well
“The gameboard could change in all kinds of bad ways over 30 years.” Nations or companies could suddenly pull out in a disastrous way. If things go down in the near future there’s fewer actors to make deals with and it’s easier to plan things out.
This is the main thing for me. We’ve done a number of wargames of this sort of regime and the regime often breaks down. (though there are things that can be done to make it harder to leave the regime, which I’m strongly in favor of).
Can we leverage useful work out of significantly-more-powerful-but-nonsuperhuman AIs? Especially since “the gameboard might change a lot”, it’s useful to get lots of safety research done quickly, and it’s easier to do that with more powerful AIs. So, it’s useful to continue to scale up until we’ve got the most powerful AIs can we can confidently control. (Whereas Controlled Takeoff skeptics tend to think AI that is capable of taking on the hard parts of AI safety research will already be too dangerous and untrustworthy)
Yep, I think we plausibly can leverage controlled AIs to do existentially useful work. But not I’m confident, and I am not saying that control is probably sufficient. I think superhuman isn’t quite the right abstraction (as I think it’s pretty plausible we can control moderately superhuman AIs, particularly only in certain domains.), but that’s a minor point. I think Plan A attempts to be robust to the worlds where this doesn’t work by just pivoting back to human intelligence augemntation or whatever.
Is there a decent chance an AI takeover is relatively nice? Giving the humans the Earth/solar system is just incredibly cheap from percentage-of-resources standpoint. This does require the AI to genuinely care about and respect our agency in a sort of complete way. But, it only has to care about us as a pretty teeny amount
This is an existential catastrophe IMO and should desperately avoided, even if they do leave us a solar system or w/e.
And then, the usual “how doomed are current alignment plans?”. My impression is “Plan A” advocates are usually expecting a pretty good chance things go pretty well if humanity is making like a reasonably good faith attempt at controlled takeoff, whereas Controlled Takeoff skeptics are typically imagining “by default this just goes really poorly, you can tell because everyone seems to keep sliding off understanding or caring about the hard parts of the problem”)
I think the thing that matters here is the curve of “likelihood of alignment success” vs “years of lead time burned at takeoff”. We are attempting to do a survey of this among thinkers in this space who we most respect on this question, and I do think that there’s substantial win equity moving from no lead time to years or decades of lead time. Of course, I’d rather have higher assurance, but I think that you really need to believe the very strong version of “current plans are doomed” to forego Plan A. I’m very much on board with “by default this goes really poorly”.
Three cruxes I still just don’t really buy as decision-relevant:
“We wouldn’t want to pause 30 years, and then do a takeoff very quickly – it’s probably better to do a smoother takeoff.” Yep, I agree. But, if you’re in a position to decide-on-purpose how smooth your takeoff is, you can still just do the slower one later. (Modulo “the gameboard could change in 30 years”, which makes more sense to me as a crux). I don’t see this as really arguing at all against what I imagined the Treaty to be about.
huh, this one seems kinda relevant to me.
“We need some kind of exit plan, the MIRI Treaty doesn’t have one.” I currently don’t really buy that Plan A has more of one than the the MIRI Treaty. The MIRI treaty establishes an international governing body that makes decisions about how to change the regulations, and it’s pretty straightforward for such an org to make judgment calls once people have started producing credible safety cases. I think imagining anything more specific than this feels pretty fake to me – that’s a decision that makes more sense to punt to people who are more informed than us.
If the international governing body starts approving AI development, then aren’t we basically just back in the plan A regime? Ofc I only think that scaling should happen once people have credible safety cases. I just think control based safety cases are sufficient. I think that we can make some speculations about what sorts of safety cases might work and which ones don’t. And I think that the fact that the MIRI treaty isn’t trying to accelerate prosaic safety / substnatially slows it down is a major point against it, which is reasonable to summarize as them not having a good exit plan.
I’m very sypathetic to pausing until we have uploads / human intelligence augmentation, that seems good, and I’d like to do that in a good world.
Shutdown is more politically intractable than Controlled Takeoff. I don’t currently buy that this is true in practice. I don’t think anyone is expecting to immediately jump to either a full-fledged version of Plan A, or a Global Shutdown. Obviously, for the near future, you try for whatever level of national and international cooperation you can get, build momentum, do the easy sells first, etc. I don’t expect, in practice, Shutdown to be different from “you did all of Plan A, and, then, took like 2-3 more steps, and by the time you’ve implemented Plan A in it’s entirety, it seems crazy to me to assume the next 2-3 steps are particularly intractable.”
I totally buy “we won’t even get to a fully fledged version of Plan A”, but, that’s not an argument for Plan A over Shutdown.
It feels like people are imagining “naive, poorly politically executed version of Shutdown, vs some savvily executed version of Plan A.” I think there are reasonable reasons to think the people advocating Shutdown will not be savvy. But, those reasons don’t extend to “insofar as you thought you could savvily advocate for Plan A, you shouldn’t be setting your sites on Shutdown.”
This one isn’t a crux for me I think. I do probably think it’s a bit more politically intractable, but even that’s not obvious because I think shutdown would play better with the generic anti-tech audience, while Plan A (as currently written) involves automating large fractions of the economy before handoff.
I think I mostly am on board with this comment. Some thoughts:
Before I did a rapid-growth of capabilities, I would want a globally set target of “we are able to make some kind of interpretability strides or evals that let us make better able to predict the outcome of the next training run.” (
this feels a bit overly binary to me. I think that understanding-based safety cases will be necessary for ASI. But behavioral methods seem like they might be sufficient before hand.
I don’t know what you mean by “rapid growth”. It seems like you might be imagining the “shut it all down → solve alignment during pause → rapidly scale after you’ve solved alignment” plan. I think we probably should never do a “rapid scaleup”
Another reaction I have is that a constraint to coordination will probably be “is the other guy doing a blacksite which will screw us over”. So I think there’s a viability bump at the point of “allow legal capabiliites scaling at least as fast as the max size blacksite that you would have a hard time detecting”.
I would want to do at least some early global pause on large training runs, to check if you are actually capable of doing that at all. (in conjunction with some efforts attempting to build international goodwill about it)
So I think this paragraph isn’t really right, because “slowdown’ != ‘pause’, and slowdowns might still be really really helpful and enough to get you a long way.
One of the more important things to do as soon as it’s viable, is to stop production of more compute in an uncontrolled fashion. (I’m guessing this plays out with some kind of pork deals for nVidia and other leaders[2], where the early steps are ‘consolidate compute’, and then them producing the chips that are more monitorable, and which they get to make money from, but also are sort of nationalized). This prevents a big overhang.
I actually currently think that you want to accelerate compute production, because hardware scaling seems safer than software scaling. I’m not sure exactly what you mean by “in an uncontrolled fashion”.. if you mean “have a bunch of inspectors making sure the flow of new chips isn’t being smuggled to illegal projects”, then I agree with this, on my initial read I thought you meant something like “pause chip production until they start producing GPUS with HEMs in them”, which I think is probably bad.
In other words I think that you want to create a big compute overhang during a pause. The downside is obvious, but the upsides are:
compute is controllable, far more than software, and so differentially advances legal projects.
more compute for safety. We want to be able to pay a big safety tax, more compute straightforwardly helps.
extra compute progress funges against software progress, which is scarier.
compute is destroyable (e.g. we can reverse and destroy compute, if we want to eat an overhang), but software progress mostly isn’t (you can’t unpublish reserach).
(this comment might be confusing because I typed it quickly, happy to clarify if you want)
One framing that I think might be helpful for thinking about “Plan A” vs “shut it all down” is: “Suppose that you have the political will for an n-year slowdown, i.e. after n years, you are forced to handoff trust to superhuman AI systems (e.g. for n = 5, 10, 30). What should the capability progression throughout the slowdown be?” This framing forces a focus on the exit condition / plan to do handoff, which I think is an underdiscussed weakness of the “shut it all down” plan.
I think my gut reaction is that the most important considerations are: (i) there are a lot of useful things you can do with the AIs, so I want more time with the smarter AIs, and (ii) I want to scale through the dangerous capability range slowly and with slack (as opposed to at the end of the slowdown).
this makes me think that particularly for a shorter slowdown (e.g. 5 years), you want to go fast at the beginning (e.g. scale to ~max controllable AI over the first year or two), and then elicit lots of work out of those AIs for the rest of the time period.
A key concern for the above plan is that govts/labs botch the measurement of “max controllable AI”, and scale too far.
But it’s not clear to me how a further delay helps with this, unless you have a plan for making the institutions better over time, or pursuing a less risky path (e.g. ignoring ML and doing human intelligence augmentation).
Going slower, on the other hand, definitely does help, but requires not shutting it all down.
More generally, it seems good to do something like “extend takeoff evenly by a factor of n”, as opposed to something like “pause for n-1 years, and then do a 1 year takeoff”.
I am sympathetic to shut all down and go for human augmentation: I do think this reduces AI takeover risk a lot, but this requires a very long pause, and it requires our institutions to bet big on a very unpopular technology. I think that convincing governments to “shut it all down” without an exit strategy at all seems quite difficult as well.
Ofc this framing also ignores some important considerations, e.g. choices about the capability progression effect both difficulty of enforcement/verification (in both directions: AI lie detectors/ai verification is helpful, while making AIs closer to the edge is a downside), as well as willingness to pay over time (e.g. scary demos or AI for epistemics might help increase WTP)
However, I also think that open agency approaches to transparency face two key difficulties: competitiveness and safety-of-the-components.[18]
I think a third key difficulty with this class of approaches is something like “emergent agency”, i.e. that each of the individual components seem to be doing something safe, but when you combine several of the agents, you get a scary agent. Intuition pump: each of the weights in a NN is very understandable (it’s just a number) and not doing dangerous scheming, but if you compose them it might be scary. Analagously, each of the subagents in the open agency AI might not be scheming, but a collection of these agents might be scheming.
Understanding the communications between the components seems like it may or may not be sufficient to mitigate this failure mode. If the understanding is “local”, i.e. looking at a particular chain of reasoning and verifying that it is valid, this is probably not sufficient to mitigate the problem, as scary reasoning might be made up of a bunch of small chains of local valid reasoning that looks safe. So I think you want something like a reasonable global picture of the reasoning that the open agent is doing in order to mitigate “emergent agency”.
I think this is kind of related to types of the “safety of the components” failure mode you talk about, particularly in the analogue to the corporation passing memos around, but the memos not corresponding to the “real reasoning” going on. However, it could be that the “real reasoning” emerges on a higher level of abstraction than the individual agents.
This sort of threat model leads me to think that if we’re aiming for this sort of open agency, we shouldn’t do end-to-end training of the whole system, lest we incentivize “emergent agency”, even if we don’t make the individual components less safe.
One upside of shut it all down is that it does in fact buy more time: in Plan A it is difficult to secure algorithmic secrets without extremely aggressive security measures, hence any rogue projects (e.g. nation state blacksites) can just coast off the algos developed by the verified projects. Then, a few years in, they fire up their cluster and try to do an intelligence explosion with the extra algorithmic progress.
>superintelligence
Small detail: My understanding of the IABIED scenario is that their AI was only moderately superhuman, not superintelligent
This post seems systematically too slow to me, and to underrate the capabilities of superintelligence. One particular point of disagreement:
It seems reasonable to use days or weeks as an upper bound on how fast robot doublings could become, based on biological analogies. This is very fast indeed.20
When I read this, I thought this would say “lower bound”. Why would you expect evolution to find globally optimal doubling times? This reads to me a bit like saying that the speed of a Cheetah or the size of an Blue Whale will be an upper bound on the speed/size of a robot. Why???
The case for lower bound seems clear: biology did it, probably a superintelligence could design a more functional robot than biology.
Small typo: Alog(B) = log(B^A), not log(A^B)
Also there’s a good chance AI gov won’t work, and labs will just have a very limited safety budget to implement their best guess mitigations. Or maybe AI gov does work and we get a large budget, we still need to actually solve alignment.
Thanks for writing this!
For what its worth, my view is that we’re very likely to be wrong about the specific details in both of the endings—they are obviously super conjunctive. I don’t think that there’s any way around this because we can be confident AGI is going to cause some ex-ante surprising things to happen.
Also, this is scenario is around 20th percentile timelines for me, my median is early 2030s (though other authors disagree with me). I also feel much more confident about the pre-2027 scenario than about the post 2027 scenario.
Is your disagreement that you think AGI will happen later, or that you think the effects of AGI on the world will look very different, or both? If its just the timelines, we might have fairly similar views.
This wasn’t intended to be humor. In the scenario, we write:
(To avoid singling out any one existing company, we’re going to describe a fictional artificial general intelligence company, which we’ll call OpenBrain. We imagine the others to be 3–9 months behind OpenBrain.)
I think that OpenAI, GDM, and Anthropic are in the lead and are the most likely to be ahead, with similar probability.
Thank you! We actually tried to write one that was much closer to a vision we endorse! The TLDR overview was something like:
Both the US and Chinese leading AGI projects stop in response to evidence of egregious misalignment.
Sign a treaty to pause smarter-than-human AI development, with compute based enforcement similar to ones described in our live scenario, except this time with humans driving the treaty instead of the AI.
Take time to solve alignment (potentially with the help of the AIs). This period could last anywhere between 1-20 years. Or maybe even longer! The best experts at this would all be brought in to the leading project, many different paths would be pursued (e.g. full mechinterp, Davidad moonshots, worst case ELK, uploads, etc).
Somehow, a do a bunch of good governance interventions on the AGI project (e.g. transparency on use of the AGIs, no helpful only access to any one. party, a formal governance structure where a large number of diverse parties all are represented.).
This culminates with aligning an AI “in the best interests of humanity” whatever that means, using a process where a large fraction of humanity is engaged and has some power to vote. This process might look something like giving each human some of the total resources of space and then doing lots of bargaining to find all the positive sum trades, with some rules against blackmail / using your resources to cause immense harm.
Unfortunately, it was hard to write this out in a way that felt realistic.
The next major project I focus on is likely going to be focusing on thinking through the right governance interventions here to make that happen. I’m probably not going to do this in scenario format (and instead something closer to normal papers and blog posts), but would be curious for thoughts.
AI 2027: What Superintelligence Looks Like
Thanks—I see, I was misunderstanding.
Proposal part 1: Shoggoth/Face Distinction: Instead of having one model undergo agency training, we have two copies of the base model work together, specializing in different parts of the job, and undergo training together. Specifically we have the “shoggoth” copy responsible for generating all the ‘reasoning’ or ‘internal’ CoT, and then we have the “face” copy responsible for the ‘actions’ or ‘external’ outputs. So e.g. in a conversation with the user, the Shoggoth would see the prompt and output a bunch of reasoning token CoT; the Face would see the prompt + all that CoT and then output a message to pass to the user, and then the process would repeat.
Proposal part 2: Blind the Evaluation Process to the Internal Reasoning: Remember that reward process that evaluates overall performance? Hide from it all the outputs of the Shoggoth. Just show it the outputs from the Face.I don’t see how this proposal could get you AGI-level capabilities. You need the “shoggoth” part of the model to gain general purpose capabilities somehow, and if you are completely shielding the shoggoth from RL, all of the capabilities have to come from pre-training, which seems unlikely. My guess is that a more realistic proposal involves having separate reward models for the two parts:
The shoggoth RM, which only rewards capabilities
The face RM, which also includes alignment-relevant stuff
But it isn’t clear to me how to separate out the alignment/capabilities. IIUC the hope is that the capabilities training seems would naturally incentivize the shoggoth to have bad goals, and perhaps instrumentally decieve the humans, but to do this deception in the COT in english so we can see.
You later write:
How it all (hopefully) fits together: First, because of the paraphraser, the Shoggoth will not learn to speak in some optimized alien language, or even to speak in jargon or euphemisms—because semantics not recognized by the paraphraser will be scrambled by it. So the Shoggoth will continue to ‘think in English’ even as it undergoes lots of agency training and learns substantial new skills.
Second, because the training process is blind to the outputs of the Shoggoth (and instead only sees what the Face outputs), there should be no pressure on the Shoggoth to learn to decieve, conceal, dissemble, doublethink, etc. Instead the Shoggoth will be ‘free’ to put all its important thoughts in plain English.Which suggests we are training the shoggoth with agency training. But I don’t understand how this is consistnet with “the training process be blind to the outputs of the Shoggoth”. Did you mean just some “alignment” subset of the training process?
I think a problem with all the proposed terms is that they are all binaries, and one bit of information is far too little to characterize takeoff:
One person’s “slow” is >10 years, another’s is >6 months.
The beginning and end points are super unclear; some people might want to put the end point near the limits of intelligence, some people might want to put the beginning points at >2x AI R&D speed, some at 10, etc.
In general, a good description of takeoff should characterize capabilities at each point on the curve.
So I don’t really think that any of the binaries are all that useful for thinking or communicating about takeoff. I don’t have a great ontology for thinking about takeoff myself to suggest instead, but I generally try to in communication just define a start and end point and then say quantitatively how long this might take. One of the central ones I really care about is the time between wakeup and takeover capable AIs.
wakeup = “the first period in time when AIs are sufficiently capable that senior government people wake up to incoming AGI and ASI”
takeover capable AIs = “the first time there is a set of AI systems that are coordinating together and could take over the world if they wanted to”
The reason to think about this period is that (kind of by construction) it’s the time where unprecedented government actions that matter could happen. And so when planning for that sort of thing this length of time really matters.
Of course, the start and end times I think about are both fairly vague. They also aren’t purely a function of AI capabilities, and they care about stuff like “who is in government” and “how capable our institutions are at fighting a rogue AGI”. Also, many people believe that we never will get takeover capable AIs even at superintelligence.
Perhaps that was overstated. I think there is maybe a 2-5% chance that Anthropic directly causes an existential catastrophe (e.g. by building a misaligned AGI). Some reasoning for that:
I doubt Anthropic will continue to be in the lead because they are behind OAI/GDM in capital. They do seem around the frontier of AI models now, though, which might translate to increased returns, but it seems like they do best on very short timelines worlds.
I think that if they could cause an intelligence explosion, it is more likely than not that they would pause for at least long enough to allow other labs into the lead. This is especially true in short timelines worlds because the gap between labs is smaller.
I think they have much better AGI safety culture than other labs (though still far from perfect), which will probably result in better adherence to voluntary commitments.
On the other hand, they haven’t been very transparent, and we haven’t seen their ASL-4 commitments. So these commitments might amount to nothing, or Anthropic might just walk them back at a critical juncture.
2-5% is still wildly high in an absolute sense! However, risk from other labs seems even higher to me, and I think that Anthropic could reduce this risk by advocating for reasonable regulations (e.g. transparency into frontier AI projects so no one can build ASI without the government noticing).
I think a central consideration should be what can actually be enforced, i.e., if it were true that ASI can be built with 1e20 FLOPs / 27-H100-hours (which I think is possible); this would be very impractical to enforce, and so we should build our plan around not having to enforce that.