Why do we need to halt for so long? In short, AI alignment is probably a difficult technical problem, and it is hard to be confident about solutions. Pausing for a substantial period gives humanity time to be careful in this domain rather than rushing. Pausing for a shorter amount of time (e.g., 5 years) might reduce risk substantially compared to the current race, but it also might not be enough. In general, world leaders should weigh the likelihood and consequence of different risks and benefits against each other for different lengths of a pause. Section 2 discusses some of the reasons why the AI alignment problem may be difficult. Generally, experts vary in their estimates of the difficulty of this problem and the likelihood of catastrophe, with some expecting the problem to be very hard [Grace et al., 2025, ControlAI, 2025, Wikipedia, 2025]. Given this uncertainty about how difficult this problem is, we should prepare to pause for a long time, 8 in case more effort is needed. Our agreement would allow for a long halt, even if world leaders later came to believe a shorter one was acceptable. We also contend that there are other problems which need to be addressed during a halt even if one presumes that alignment can be quickly solved, and these problems are also of an uncertain difficulty. These include risks of power concentration, human misuse of AIs, mass-unemployment, and many more. World leaders will likely want at least years to understand and address these problems. The international agreement proposed in this paper is primarily motivated by risks from AI misalignment, but there are numerous other risks that it would also help reduce.
I agree with a lot of this, but I do think this paper ambiguates a bit between “we need to halt for decades” and “we might need to halt for decades”. I agree with the latter but not the former,.
I also think that in the cases where alignment is solvable sooner, then it might matter a lot that we accelerated alignment in the meantime.
I get that it’s scary to have to try to bifurcate alignment and capabilities progress because governments are bad at stuff, but I think it’s a mistake to ban AI research, because it will have very negative consequences on the rate of AI alignment research. I think that we should try hard to figure out what can be done safely (e.g. via things like control evals), and then do alignment work on models that we can empirically study that are as capable as possible while incurring minimal risks.
Serial time isn’t the only input that matters: having smarter AIs is helpful as research assistants and to do experiments directly on the smarter AIs, having lots of compute to do alignment experiments is nice, having lots of money and talent going into AI alignment is helpful. I think you guys should emphasize and think about the function you are trying to maximize more clearly (i.e. how much do you really care about marginal serial time vs marginal serial time with smart AIs to do experiments on).
I’m struck by how many of your cruxes seem like things that it would actually just be in the hands of the international governing body to control. My guess is, if DARPA has a team of safety researchers, and they go to the international body, and they’re like ‘we’re blocked by this set of experiments* that takes a large amount of compute; can we please have more compute?’, and then the international body gets some panel of independent researchers to confirm that this is true, and the only solution is more compute for that particular group of researchers, they commission a datacenter or something so that the research can continue.
Like, it seems obviously true to me that people (especially in government/military) will continue working on the problem at all, and that access to larger amounts of resources for doing that work is a matter of petitioning the body. It feels like your plan is built around facilitating this kind of carveout, and the MIRI plan is built around treating it as the exception that it is (and prioritizing gaining some centralized control over AI as a field over guaranteeing to-me-implausible rapid progress toward the best possible outcomes).
*which maybe is ‘building automated alignment researchers’, but better specified and less terrifying
I think, by the time any kind of international deal goes through, we will basically have already reached the frontier of what was safe, so it feels like splitting hairs, discussing whether the regime should want more capabilities in the immediate future.
(surely it’s going to take at least a year, which is a pretty long time, it’s probably not going to happen at all, and 3 years to even get started is more like what I imagine when I imagine a very successful overton-smashing campaign)
I think there’s tons of research augmentation you can do 1-3-year-from-now AI, that are more about leveraging the existing capabilities than getting fundamentally smarter
I don’t buy that there’s a way to get end-to-end research, or “fundamentally smarter” research assistants, that aren’t unacceptably dangerous at scale. (i.e. I believe you can train on more specific ). (man I have no idea what I meant by that sentence fragment. sorry, person who reacted with ”?”)
Do those feel like subcruxes for you, or are there other ones?
Thanks for writing this paper.
I agree with a lot of this, but I do think this paper ambiguates a bit between “we need to halt for decades” and “we might need to halt for decades”. I agree with the latter but not the former,.
I also think that in the cases where alignment is solvable sooner, then it might matter a lot that we accelerated alignment in the meantime.
I get that it’s scary to have to try to bifurcate alignment and capabilities progress because governments are bad at stuff, but I think it’s a mistake to ban AI research, because it will have very negative consequences on the rate of AI alignment research. I think that we should try hard to figure out what can be done safely (e.g. via things like control evals), and then do alignment work on models that we can empirically study that are as capable as possible while incurring minimal risks.
Serial time isn’t the only input that matters: having smarter AIs is helpful as research assistants and to do experiments directly on the smarter AIs, having lots of compute to do alignment experiments is nice, having lots of money and talent going into AI alignment is helpful. I think you guys should emphasize and think about the function you are trying to maximize more clearly (i.e. how much do you really care about marginal serial time vs marginal serial time with smart AIs to do experiments on).
I’m struck by how many of your cruxes seem like things that it would actually just be in the hands of the international governing body to control. My guess is, if DARPA has a team of safety researchers, and they go to the international body, and they’re like ‘we’re blocked by this set of experiments* that takes a large amount of compute; can we please have more compute?’, and then the international body gets some panel of independent researchers to confirm that this is true, and the only solution is more compute for that particular group of researchers, they commission a datacenter or something so that the research can continue.
Like, it seems obviously true to me that people (especially in government/military) will continue working on the problem at all, and that access to larger amounts of resources for doing that work is a matter of petitioning the body. It feels like your plan is built around facilitating this kind of carveout, and the MIRI plan is built around treating it as the exception that it is (and prioritizing gaining some centralized control over AI as a field over guaranteeing to-me-implausible rapid progress toward the best possible outcomes).
*which maybe is ‘building automated alignment researchers’, but better specified and less terrifying
Subcruxes of mine here:
I think, by the time any kind of international deal goes through, we will basically have already reached the frontier of what was safe, so it feels like splitting hairs, discussing whether the regime should want more capabilities in the immediate future.
(surely it’s going to take at least a year, which is a pretty long time, it’s probably not going to happen at all, and 3 years to even get started is more like what I imagine when I imagine a very successful overton-smashing campaign)
I think there’s tons of research augmentation you can do 1-3-year-from-now AI, that are more about leveraging the existing capabilities than getting fundamentally smarter
I don’t buy that there’s a way to get end-to-end research, or “fundamentally smarter” research assistants, that aren’t unacceptably dangerous at scale.
(i.e. I believe you can train on more specific ).(man I have no idea what I meant by that sentence fragment. sorry, person who reacted with ”?”)Do those feel like subcruxes for you, or are there other ones?