I’m broadly interested in AI strategy and want to figure out the most effective interventions to get good AI outcomes.
Thomas Larsen
Also, the above isn’t even mentioning bio x-risk mediated by humans, or by trailing AIs during the chaos of takeoff. My guess is those risks are substantially lower, e.g. maybe 1% and 2% respectively; again don’t feel confident.
Note: These are all rough numbers, I’d expect I’d shift substantially about all of this on further debate.
Suppose we made humanity completely robust to biorisk, i.e. we did sufficient preparation such that the risk of bio catastrophe (including AI mediated biocatastrophe) was basically 0.[1] How much would this reduce total x-risk?
The basic story for any specific takeover path not mattering much is that the AIs, conditional on them being wanting to take over, will self-improve until they find they find the next easiest takeover path and do that instead. I think that this is persuasive but not fully because:
AIs need to worry about their own alignment problem, meaning that they may not be able to self improve in an unconstrained fashion. We can break down the possibilities into (i) the AIs are aligned with their successors (either by default or via alignment being pretty easy), (ii) the AIs are misaligned with their successors but they execute a values handshake, or (iii) the AIs are misaligned with their successors (and they don’t solve this problem or do a values handshake). At the point of full automation of the AI R&D process (which I currently think of as the point at which AIs become more useful than humans for making AI progress, i.e., if we remove all AIs, progress slows by more than if we remove the humans), conditional on the AIs being misaligned, I currently think the relative likelihood of (i), (ii) and (iii) is 1:1:3, and the probability flows from (iii) into (ii) and (i) as the AIs get smarter.[2]
Raising the capability threshold that the AIs need to takeover gives humanity more time. During this time, we might make alignment progress or notice that the AIs are misaligned and slow down / pause / add mitigations.
So it might be important for misaligned AIs to attempt a takeover early in the intelligence explosion. Specifically, we can ask “how much x-risk is averted if the probability of misaligned AI takeover before TED AI goes to 0?”, which attempts to capture all the worlds in which AIs attempt to takeover before TED AI. I think my overall risk reduction is something like 1⁄4. In other words, there’s something like a 3⁄4 chance the AIs lurk (and can goal guard or align successors), or can backdoor successors, or control their successors.
Now, conditional on the pre-TED AIs attempting to takeover, what are the different routes that they might use? The most salient options to me are:
Bioweapon, then outcompete rest of humanity (e.g. via robots or human cultists)
Political operation / persuasion to quickly capture labs / govts (sufficient to prevent them from building other AIs while retaining your ability to foom in the future, once you’ve solved alignment, and to prevent safety research from happening that could expose you).
Macro-scale robot uprising (e.g. with drones / humanoid robots).
Nanotech
Other exotic tech / unknown unknowns.
I think the bio path seems to me to be the most compelling path here by a fair amount; it maybe gets another 1⁄3 probability of this outcome. So, just from the risk of pre-TED AI AIs attemptinng to takeover, we have somehting like 1⁄3 * 1⁄4 = 1⁄12 probability. If you multiply that by my likelihood of AI takeover, which is around 70%, you get ~6% risk flowing from this route. Then, I update up to ~8% from other AIs, e.g. post-TED AIs relying on biorisk as a route to takeover.
So my overall view on how much x-risk flows through bio-catastrophe is around 8%.
- ^
Note that what exactly counts as a bio x-risk is slightly unclear, e.g. at some point the AIs can build drones / nanotech to get into the bio-bunkers, and it’s unclear what counts.
- ^
This breakdown isn’t exhaustive, because another salient possibility is that the AIs are clueless, e.g., they are misaligned with their successors but don’t realize it, similar to Agent 3 in AI 2027.
I think that people overrate bayesian reasoning and underrate “figure out the right ontology”.
Most of the way good thinking happens IMO is by finding and using a good ontology for thinking about some situation, not by probabilistic calculation. When I learned calculus, for example, it wasn’t mostly that I had uncertainty over a bunch of logical statements, which I then strongly updated on learning the new theorems, it was instead that I learned a bunch of new concepts, which I then applied to reason about the world.
I think AI safety generally has much better concepts for thinking about the future of AI than others, and this is a key source of alpha we have. But, there are obviously still a huge number of disagreements remaining within AI safety. I would guess that debates would be more productive if we more explicitly focused on the ontology/framing that each other are using to reason about the situation, and then discussed to what extent that framing captures the dynamics we think are important.
I think it would be good if more people say things like “I think that’s a bad concept, because it obscures consideration X, which is important for thinking about the situation”.
Here are some widely used concepts I think are bad and I wish became less load bearing in AI safety discourse:
“Fast” and “slow” takeoff; takeoff speeds in general. I think these concepts are very unclear and not super natural. There are various operationalizations of these (e.g. Paul’s “slow takeoff” = a gdp doubling over the course of the 4 years before the first single year that gdp doubles). This is obviously arbitrary, and I don’t see why world’s that meet this definition are worthwhile reasoning about separately from worlds that don’t meet this definition. I also think its easy to smuggle in lots of other correlations here, like slow takeoff = people are woken up, or slow takeoff = alignment is much easier, etc.
I feel more excited about talking about things like “I think milestone X will happen at date Y”, or “My median is that milestone X and Z are Y time apart”. For example, I think the concepts of automating coding, automating all AI research, automating ~the whole economy, and increasing earth’s energy output 1000x are all useful capability milestones, and it’s useful to talk about these.
“scheming”. I think the definition of scheming is pretty unclear, and changes a bunch depending on the context.
Under some definitions I believe that the AIs are always going to be scheming, under others, it seems kind of narrow and unnatural.
I somewhat prefer the concepts from the “alignment over time” box in AI 2027.
“gradual disempowerment”
I think that this conflates a bunch of scenarios / threat models together, some of which don’t make sense, some of which do make sense, but I don’t think that the solutions are very related.
Gradual disempowerment is often presented as not requiring any misalignment on behalf of the AIs. If there are AIs that are aligned with any human principals, we get a situation where, even without any coordination, AIs compete on behalf of their principals, and then give the principals whatever surplus they are able to produce.
Given that framing, we can now talk specific threat models. For example, maybe their is no surplus: warfare/competition eats away all the additional resources, and space is consumed purely by the optimal self replicators. Alternatively, maybe the AIs weren’t actually acting in the interests of humanity. Finally, maybe the process of competing hard was existentially catastrophic early on, e.g., maybe it resulted in the oceans being boiled (and humans didn’t take appropriate countermeasures), resulting in extinction.
Note: I’m not saying that all concepts that I think are bad can be saved or crystallized into good concept, often the best idea is to just forget about the original concept and use better/crisper concepts.
I should also maybe give some concepts I think are generally good: p(doom) from AI takeover, timelines, x-risk, s-risk, recursive self improvement, internal/external deployment, AI control, AI alignment. I don’t think any of these concepts are fully crisp, all have somewhat important edge cases, but I think they are good enough abstractions to be very useful for thinking about the future.
Relatedly, I’m also interested in the simple method of extrapolating AI revenue growth trends until AI revenue is most of the world economy. That seems like a decent proxy for when AGI will be achieved. I trust this method less than our model for obvious reasons, but I still put some weight on it. What does it say? Well, it says “Early 2030s.” OK.
I’m curious why you trust revenue extrapolation less than the model. Intuitively revenue seems like a better thing to extrapolate to me than benchmarks or flops or whatever because it’s much less gameable and there’s a much more clear threshold for AGI (revenue is similar size to GDP).
Agree with what Habryka said. Also, Daniel, I, and other AIFP people would update and care about being cooperative / feedback. If anyone is interested in giving feedback on our new scenario about a positive vision post-AGI (about either the content, or the name/branding), please email me.
Also to reiterate: AI 2027 was obviously not a confident prediction of AGI in 2027, it was a scenario where AI happened in 2027, which seems like a plausible and IMO ~modal timeline, and we clearly stated this on the website.
I think that to ban ASI you’d need to ban something like AGI because of intelligence explosion dynamics, so not clear it makes a big difference.
>proper global UBI is *enormously* expensive (h/t @yelizarovanna)
This seems wrong. There will be huge amounts of wealth post-ASI. Even a relatively small UBI (e.g. 1% of AI companies) will be enough to support way better QOL for everyone on earth. Moreover, everything will become way cheaper because of efficiency gains downstream of AI. Even just at AGI, I think it’s plausible that physical labour is something like 10x cheaper and cognitive labour is something like 1000x cheaper.
So, Korea and Japan are building over 100x as much ship per worker-year as the US
Can you spell out why you think this? Do Korea and Japan produce ~100x more ships than the US?
I think a central consideration should be what can actually be enforced, i.e., if it were true that ASI can be built with 1e20 FLOPs / 27-H100-hours (which I think is possible); this would be very impractical to enforce, and so we should build our plan around not having to enforce that.
Thanks for writing this paper.
Why do we need to halt for so long? In short, AI alignment is probably a difficult technical problem, and it is hard to be confident about solutions. Pausing for a substantial period gives humanity time to be careful in this domain rather than rushing. Pausing for a shorter amount of time (e.g., 5 years) might reduce risk substantially compared to the current race, but it also might not be enough. In general, world leaders should weigh the likelihood and consequence of different risks and benefits against each other for different lengths of a pause. Section 2 discusses some of the reasons why the AI alignment problem may be difficult. Generally, experts vary in their estimates of the difficulty of this problem and the likelihood of catastrophe, with some expecting the problem to be very hard [Grace et al., 2025, ControlAI, 2025, Wikipedia, 2025]. Given this uncertainty about how difficult this problem is, we should prepare to pause for a long time, 8 in case more effort is needed. Our agreement would allow for a long halt, even if world leaders later came to believe a shorter one was acceptable. We also contend that there are other problems which need to be addressed during a halt even if one presumes that alignment can be quickly solved, and these problems are also of an uncertain difficulty. These include risks of power concentration, human misuse of AIs, mass-unemployment, and many more. World leaders will likely want at least years to understand and address these problems. The international agreement proposed in this paper is primarily motivated by risks from AI misalignment, but there are numerous other risks that it would also help reduce.
I agree with a lot of this, but I do think this paper ambiguates a bit between “we need to halt for decades” and “we might need to halt for decades”. I agree with the latter but not the former,.
I also think that in the cases where alignment is solvable sooner, then it might matter a lot that we accelerated alignment in the meantime.
I get that it’s scary to have to try to bifurcate alignment and capabilities progress because governments are bad at stuff, but I think it’s a mistake to ban AI research, because it will have very negative consequences on the rate of AI alignment research. I think that we should try hard to figure out what can be done safely (e.g. via things like control evals), and then do alignment work on models that we can empirically study that are as capable as possible while incurring minimal risks.
Serial time isn’t the only input that matters: having smarter AIs is helpful as research assistants and to do experiments directly on the smarter AIs, having lots of compute to do alignment experiments is nice, having lots of money and talent going into AI alignment is helpful. I think you guys should emphasize and think about the function you are trying to maximize more clearly (i.e. how much do you really care about marginal serial time vs marginal serial time with smart AIs to do experiments on).
Thanks, I thought this was a helpful comment. Putting my responses inline in case it’s helpful for people. I’ll flag that I’m a bit worried about confirmation bias / digging my heels in: would love to recognize it if I’m wrong.
How bad is Chinese Superintelligence? For some people, it’s a serious crux whether a China-run superintelligence would be dramatically worse in outcome than a democratic country.
This isn’t a central crux for me I think. I would say that it’s worse, but that I’m willing to make concessions here in order to make alignment more likely to go well
“The gameboard could change in all kinds of bad ways over 30 years.” Nations or companies could suddenly pull out in a disastrous way. If things go down in the near future there’s fewer actors to make deals with and it’s easier to plan things out.
This is the main thing for me. We’ve done a number of wargames of this sort of regime and the regime often breaks down. (though there are things that can be done to make it harder to leave the regime, which I’m strongly in favor of).
Can we leverage useful work out of significantly-more-powerful-but-nonsuperhuman AIs? Especially since “the gameboard might change a lot”, it’s useful to get lots of safety research done quickly, and it’s easier to do that with more powerful AIs. So, it’s useful to continue to scale up until we’ve got the most powerful AIs can we can confidently control. (Whereas Controlled Takeoff skeptics tend to think AI that is capable of taking on the hard parts of AI safety research will already be too dangerous and untrustworthy)
Yep, I think we plausibly can leverage controlled AIs to do existentially useful work. But not I’m confident, and I am not saying that control is probably sufficient. I think superhuman isn’t quite the right abstraction (as I think it’s pretty plausible we can control moderately superhuman AIs, particularly only in certain domains.), but that’s a minor point. I think Plan A attempts to be robust to the worlds where this doesn’t work by just pivoting back to human intelligence augemntation or whatever.
Is there a decent chance an AI takeover is relatively nice? Giving the humans the Earth/solar system is just incredibly cheap from percentage-of-resources standpoint. This does require the AI to genuinely care about and respect our agency in a sort of complete way. But, it only has to care about us as a pretty teeny amount
This is an existential catastrophe IMO and should desperately avoided, even if they do leave us a solar system or w/e.
And then, the usual “how doomed are current alignment plans?”. My impression is “Plan A” advocates are usually expecting a pretty good chance things go pretty well if humanity is making like a reasonably good faith attempt at controlled takeoff, whereas Controlled Takeoff skeptics are typically imagining “by default this just goes really poorly, you can tell because everyone seems to keep sliding off understanding or caring about the hard parts of the problem”)
I think the thing that matters here is the curve of “likelihood of alignment success” vs “years of lead time burned at takeoff”. We are attempting to do a survey of this among thinkers in this space who we most respect on this question, and I do think that there’s substantial win equity moving from no lead time to years or decades of lead time. Of course, I’d rather have higher assurance, but I think that you really need to believe the very strong version of “current plans are doomed” to forego Plan A. I’m very much on board with “by default this goes really poorly”.
Three cruxes I still just don’t really buy as decision-relevant:
“We wouldn’t want to pause 30 years, and then do a takeoff very quickly – it’s probably better to do a smoother takeoff.” Yep, I agree. But, if you’re in a position to decide-on-purpose how smooth your takeoff is, you can still just do the slower one later. (Modulo “the gameboard could change in 30 years”, which makes more sense to me as a crux). I don’t see this as really arguing at all against what I imagined the Treaty to be about.
huh, this one seems kinda relevant to me.
“We need some kind of exit plan, the MIRI Treaty doesn’t have one.” I currently don’t really buy that Plan A has more of one than the the MIRI Treaty. The MIRI treaty establishes an international governing body that makes decisions about how to change the regulations, and it’s pretty straightforward for such an org to make judgment calls once people have started producing credible safety cases. I think imagining anything more specific than this feels pretty fake to me – that’s a decision that makes more sense to punt to people who are more informed than us.
If the international governing body starts approving AI development, then aren’t we basically just back in the plan A regime? Ofc I only think that scaling should happen once people have credible safety cases. I just think control based safety cases are sufficient. I think that we can make some speculations about what sorts of safety cases might work and which ones don’t. And I think that the fact that the MIRI treaty isn’t trying to accelerate prosaic safety / substnatially slows it down is a major point against it, which is reasonable to summarize as them not having a good exit plan.
I’m very sypathetic to pausing until we have uploads / human intelligence augmentation, that seems good, and I’d like to do that in a good world.
Shutdown is more politically intractable than Controlled Takeoff. I don’t currently buy that this is true in practice. I don’t think anyone is expecting to immediately jump to either a full-fledged version of Plan A, or a Global Shutdown. Obviously, for the near future, you try for whatever level of national and international cooperation you can get, build momentum, do the easy sells first, etc. I don’t expect, in practice, Shutdown to be different from “you did all of Plan A, and, then, took like 2-3 more steps, and by the time you’ve implemented Plan A in it’s entirety, it seems crazy to me to assume the next 2-3 steps are particularly intractable.”
I totally buy “we won’t even get to a fully fledged version of Plan A”, but, that’s not an argument for Plan A over Shutdown.
It feels like people are imagining “naive, poorly politically executed version of Shutdown, vs some savvily executed version of Plan A.” I think there are reasonable reasons to think the people advocating Shutdown will not be savvy. But, those reasons don’t extend to “insofar as you thought you could savvily advocate for Plan A, you shouldn’t be setting your sites on Shutdown.”
This one isn’t a crux for me I think. I do probably think it’s a bit more politically intractable, but even that’s not obvious because I think shutdown would play better with the generic anti-tech audience, while Plan A (as currently written) involves automating large fractions of the economy before handoff.
I think I mostly am on board with this comment. Some thoughts:
Before I did a rapid-growth of capabilities, I would want a globally set target of “we are able to make some kind of interpretability strides or evals that let us make better able to predict the outcome of the next training run.” (
this feels a bit overly binary to me. I think that understanding-based safety cases will be necessary for ASI. But behavioral methods seem like they might be sufficient before hand.
I don’t know what you mean by “rapid growth”. It seems like you might be imagining the “shut it all down → solve alignment during pause → rapidly scale after you’ve solved alignment” plan. I think we probably should never do a “rapid scaleup”
Another reaction I have is that a constraint to coordination will probably be “is the other guy doing a blacksite which will screw us over”. So I think there’s a viability bump at the point of “allow legal capabiliites scaling at least as fast as the max size blacksite that you would have a hard time detecting”.
I would want to do at least some early global pause on large training runs, to check if you are actually capable of doing that at all. (in conjunction with some efforts attempting to build international goodwill about it)
So I think this paragraph isn’t really right, because “slowdown’ != ‘pause’, and slowdowns might still be really really helpful and enough to get you a long way.
One of the more important things to do as soon as it’s viable, is to stop production of more compute in an uncontrolled fashion. (I’m guessing this plays out with some kind of pork deals for nVidia and other leaders[2], where the early steps are ‘consolidate compute’, and then them producing the chips that are more monitorable, and which they get to make money from, but also are sort of nationalized). This prevents a big overhang.
I actually currently think that you want to accelerate compute production, because hardware scaling seems safer than software scaling. I’m not sure exactly what you mean by “in an uncontrolled fashion”.. if you mean “have a bunch of inspectors making sure the flow of new chips isn’t being smuggled to illegal projects”, then I agree with this, on my initial read I thought you meant something like “pause chip production until they start producing GPUS with HEMs in them”, which I think is probably bad.
In other words I think that you want to create a big compute overhang during a pause. The downside is obvious, but the upsides are:
compute is controllable, far more than software, and so differentially advances legal projects.
more compute for safety. We want to be able to pay a big safety tax, more compute straightforwardly helps.
extra compute progress funges against software progress, which is scarier.
compute is destroyable (e.g. we can reverse and destroy compute, if we want to eat an overhang), but software progress mostly isn’t (you can’t unpublish reserach).
(this comment might be confusing because I typed it quickly, happy to clarify if you want)
One framing that I think might be helpful for thinking about “Plan A” vs “shut it all down” is: “Suppose that you have the political will for an n-year slowdown, i.e. after n years, you are forced to handoff trust to superhuman AI systems (e.g. for n = 5, 10, 30). What should the capability progression throughout the slowdown be?” This framing forces a focus on the exit condition / plan to do handoff, which I think is an underdiscussed weakness of the “shut it all down” plan.
I think my gut reaction is that the most important considerations are: (i) there are a lot of useful things you can do with the AIs, so I want more time with the smarter AIs, and (ii) I want to scale through the dangerous capability range slowly and with slack (as opposed to at the end of the slowdown).
this makes me think that particularly for a shorter slowdown (e.g. 5 years), you want to go fast at the beginning (e.g. scale to ~max controllable AI over the first year or two), and then elicit lots of work out of those AIs for the rest of the time period.
A key concern for the above plan is that govts/labs botch the measurement of “max controllable AI”, and scale too far.
But it’s not clear to me how a further delay helps with this, unless you have a plan for making the institutions better over time, or pursuing a less risky path (e.g. ignoring ML and doing human intelligence augmentation).
Going slower, on the other hand, definitely does help, but requires not shutting it all down.
More generally, it seems good to do something like “extend takeoff evenly by a factor of n”, as opposed to something like “pause for n-1 years, and then do a 1 year takeoff”.
I am sympathetic to shut all down and go for human augmentation: I do think this reduces AI takeover risk a lot, but this requires a very long pause, and it requires our institutions to bet big on a very unpopular technology. I think that convincing governments to “shut it all down” without an exit strategy at all seems quite difficult as well.
Ofc this framing also ignores some important considerations, e.g. choices about the capability progression effect both difficulty of enforcement/verification (in both directions: AI lie detectors/ai verification is helpful, while making AIs closer to the edge is a downside), as well as willingness to pay over time (e.g. scary demos or AI for epistemics might help increase WTP)
However, I also think that open agency approaches to transparency face two key difficulties: competitiveness and safety-of-the-components.[18]
I think a third key difficulty with this class of approaches is something like “emergent agency”, i.e. that each of the individual components seem to be doing something safe, but when you combine several of the agents, you get a scary agent. Intuition pump: each of the weights in a NN is very understandable (it’s just a number) and not doing dangerous scheming, but if you compose them it might be scary. Analagously, each of the subagents in the open agency AI might not be scheming, but a collection of these agents might be scheming.
Understanding the communications between the components seems like it may or may not be sufficient to mitigate this failure mode. If the understanding is “local”, i.e. looking at a particular chain of reasoning and verifying that it is valid, this is probably not sufficient to mitigate the problem, as scary reasoning might be made up of a bunch of small chains of local valid reasoning that looks safe. So I think you want something like a reasonable global picture of the reasoning that the open agent is doing in order to mitigate “emergent agency”.
I think this is kind of related to types of the “safety of the components” failure mode you talk about, particularly in the analogue to the corporation passing memos around, but the memos not corresponding to the “real reasoning” going on. However, it could be that the “real reasoning” emerges on a higher level of abstraction than the individual agents.
This sort of threat model leads me to think that if we’re aiming for this sort of open agency, we shouldn’t do end-to-end training of the whole system, lest we incentivize “emergent agency”, even if we don’t make the individual components less safe.
One upside of shut it all down is that it does in fact buy more time: in Plan A it is difficult to secure algorithmic secrets without extremely aggressive security measures, hence any rogue projects (e.g. nation state blacksites) can just coast off the algos developed by the verified projects. Then, a few years in, they fire up their cluster and try to do an intelligence explosion with the extra algorithmic progress.
>superintelligence
Small detail: My understanding of the IABIED scenario is that their AI was only moderately superhuman, not superintelligent
This post seems systematically too slow to me, and to underrate the capabilities of superintelligence. One particular point of disagreement:
It seems reasonable to use days or weeks as an upper bound on how fast robot doublings could become, based on biological analogies. This is very fast indeed.20
When I read this, I thought this would say “lower bound”. Why would you expect evolution to find globally optimal doubling times? This reads to me a bit like saying that the speed of a Cheetah or the size of an Blue Whale will be an upper bound on the speed/size of a robot. Why???
The case for lower bound seems clear: biology did it, probably a superintelligence could design a more functional robot than biology.
Small typo: Alog(B) = log(B^A), not log(A^B)
Also there’s a good chance AI gov won’t work, and labs will just have a very limited safety budget to implement their best guess mitigations. Or maybe AI gov does work and we get a large budget, we still need to actually solve alignment.
I agree that AI successor-alignment is probably easier than the human AI alignment problem.
One additional difficulty for the AIs is that they need to solve the alignment problem in a way that humans won’t notice/understand (or else the humans could take the alignment solution and use it for themselves / shutdown the AIs). During the regime before human obsolescence, if we do a reasonable job at control, I think it’ll be hard for them to pull that off.