Why I Think Pause is Impossible
Note: I deeply believe in trying to figure out how to make AI go well for humanity. If you’ve read other things I’ve written you’ll notice I am too dumb to figure out solutions[1], but on occasion I think I see gaps in proposed solutions from others. I am not writing this on the basis of some weirdo motivation other than to try to encourage debate and rigorous thinking on the very popular idea of pausing AI development because I think it has major vulnerabilities.
Why Pause?
The idea of a global pause feels natural: we can see that AI is changing everything, and no one can deny that there is a chance things could go very poorly, so wouldn’t it be good to pause to give us more time to figure things out? Depending on how you view the stakes, the idea could even be viewed as a moral obligation. Unfortunately, I think pause is almost certainly an impossibility due to the structure of the game being played. Pausing ends up as a robustly dominated strategy that no rational self-preserving actor can choose regardless of how dangerous they think superintelligence might be.
For pause to work four things would need to be true, which are not: the game would need to continue indefinitely, compliance would need to be verifiable, pausing would need to be decision-theoretically rational, and uncertainty must favor caution. I am going to work through each of these in turn.
The Game Does Not Continue Indefinitely
Every call for a coordinated pause, regardless of whether it is a voluntary moratorium, a binding treaty, or a formal non-proliferation agreement[2] implicitly relies on the Folk Theorem of repeated games. The concept is simple: cooperation is maintained because the discounted value of continued play in the future is more than the temptation to cheat. This is the basic idea that explains why we didn’t all die in a nuclear apocalypse during the Cold War, and fingers crossed if you are reading this, still haven’t. Everyone understands that this game doesn’t end.
The race to superintelligence is different because it has an “absorbing” state, a state that once entered ends the game forever. In this case, the game could end forever because the first to superintelligence could provide omnipotent, unipolar, control. However; we don’t actually need this strict of condition, we only need a weaker threshold property: that there exists some capability level
Any actor with AI like this can improve faster, beat their competitors more thoroughly, etc. in a way that trailing actors can’t keep up with, eventually causing the trailing actors to discontinuously shift from “compete” to “capitulate”. When that happens, the game is over. This violates the Folk Theorem because it requires the cumulative probability of the game continuing forever to be strictly positive. [3]
This is immensely problematic because in any given period a rational actor can reach
Compliance is Not Verifiable
Think about this: merely having a nuclear weapon doesn’t win the game, using it does. And if an agent uses a nuclear bomb all the other agents would know in minutes and launch retaliatory strikes. In other words, no one can secretly win the war. This is different than the path to superintelligence because crossing
As I mentioned in my essay on canonical probabilities[4], general purpose software natively contain “splice programs”, research programs that look exactly like normal narrow AI development or safety research until the moment they cross
where
Pausing is Not Decision-Theoretically Rational
Let’s backtrack for a moment and assume the two points above were wrong (the game continued indefinitely and pausing was verifiable). Even then, pause is robustly dominated. If you erroneously use infinite payoffs[5] in your payoff matrix, it’s easy to get stuck in Pascal’s Mugging situations, so I don’t want to use anything with infinity here.
The probabilities that would allow an agent to figure out if pause is rational are stuck in an uncertain range (an “identification region”). Because the agent has no way of knowing the exact chances, they must assume the worst and make the best decision they can (Minimax Regret). To show this, let’s define prior-free identification regions for the key parameters:
: partially identified probability an adversary secretly defects. : partially identified probability defection successfully yields . : value of unipolar ASI (astronomically large, but strictly finite, no need for it to be ). : finite loss from getting caught defecting (sanctions, kinetic strikes, etc.). : status quo value.
Now we will make only one overall assumption: that we cannot canonically exclude the possibility of secret defection, that we can’t exclude the possibility of success, and that the unipolar threshold advantage is more valuable than the chance of getting caught defecting and the status quo.
Next we apply the minimax regret where the regret of an action is the difference between the payoff you received and would have received if you had chosen optimally:
The maximum regret of pausing is that an adversary defects and achieves
which forces you into unrecoverable strategic subordination. Under any point in the identification region this expected regret is at least .The maximum regret of defecting is being caught before you reach
and suffer which is a finite, canonically bounded geopolitical penalty.
Because evidential screening ensures the identification region
Uncertainty Does Not Favor Caution
This section feels bad to write because it is very counter-intuitive. Most of the time when you are uncertain about risks, it makes all the sense in the world to be cautious and pause. If you are in the mountains and want to cross a slope but are not sure if it is going to avalanche, it’s totally reasonable to not cross it or wait and get more data.
But in the AI development game, irreducible uncertainty is anti-cautious. What I mean is that for pause to be a rational strategy, a pausing actor must have canonical confidence that their adversary’s defection probability is very close to zero. But the non-canonical framework shows that an actor literally can’t figure out this probability in any policy-relevant timeline. So pausing isn’t safe in any traditional sense and it asks nationstates to make a sovereignty-level bet on a parameter that is provably unknowable.
Even worse, this has a sort of ratchet effect: once a nation tastes the forbidden sweetness of defecting, each increment of progress lowers the cost to get to
Conclusion
In an attempt to offer a constructive solution, one counterintuitive idea I had (one that I only hold loosely) is that more heavily entangling AI development among nations, rather than trying to segregate it under a racing development paradigm, may be beneficial. In other words, if the US opened the floodgates to Chinese firms and they jointly worked together then it is the world that approaches
In conclusion, strategies ported from the 20th century are appealing because they are familiar and have served humanity well. It would be awesome if superintelligence had the same qualities that made the last centuries crises so tractable. It doesn’t. It would be better for humanity if we stopped pretending it did.
- ^
I tried to think of one solution at the conclusion, but it may be a really bad idea.
- ^
I wrote a more narrow critique on this topic here: The Jackpot Jinx (or why “Superintelligence Strategy” is wrong).
- ^
The cumulative probability
must be strictly positive. is the probability that someone reaches the absorbing state in period . The infinite product is positive iff . As global compute scales, more actors enter the race, and AI becomes better, is not decaying, rather it is likely bounded away from 0 or increasing, so the sum diverges and the product converges to zero. Importantly, unlike the exogenous termination risks that the Folk Theorem can accommodate, reaching is an endogenous choice that rewards the actor who makes it, meaning the game wouldn’t end randomly, it ends because a player chose to win. - ^
- ^
As I stupidly did in my previous paper The Jackpot Jinx.
- ^
Note: this could be contestable if
is a nuclear war or something terrible.
There is a risk that AIs are the winners, and no human countries or human individuals win (this could take the form of extinction or permanent disempowerment). And this risk plausibly changes, reduces over time (rather than the uncertainty about a fixed risk getting resolved), as better understanding of how to manage the technical and social problems becomes available. What a ban/pause buys humanity is this improvement in the chances that there are any winners for humanity at all, once sufficiently capable AIs are eventually built.
(This can make unilaterally defecting undesirable, because the defector loses too, gets a worse outcome than if they didn’t defect, there is no temptation to cheat sooner rather than later. And there is a simple if painful way of ensuring verification, if nobody builds fabs that are too advanced, or giant datacenters. There are less painful ways to ensure verification that maintain a lot of the effect.)
An analysis that ignores this factor could be useful in that it explores the possibility of agreement on policy with people who categorically disbelieve literal extinction or permanent disempowerment (where AIs end up controlling almost all resources and the future of humanity is left with scraps, with no possibility remaining of literally ever changing this), or that these risks can get better if humanity takes time and doesn’t build powerful AIs immediately, as soon as technologically possible.
Also, as more understanding accumulates and the world gets closer to building sufficiently capable AIs safely, the same issues would emerge, as in a world that doesn’t significantly risk extinction or permanent disempowerment for all of humanity (or where this risk is constant and isn’t being reduced with a longer or better coordinated ban/pause). So safely coordinating a ban/pause towards its end is also an important class of problems, even in a ban/pause world that solves most of the other problems.
Thanks for the comment, I think you have some great points. I am going to try to explain my responses, even though I suspect they may be as unpopular as the post already is. One quick thing at the outset: my essay is not based on disbelieving extinction or terrible outcomes risk, my claim is pause is dominated even if you take that risk very seriously (which I do.)
My model does assume that V>0, and if the winner might have catastrophic devastation then V is a lottery rather than a clear reward. That said, I think adding alignment risks makes the argument harder to address. The verification problem is the same because you can’t tell if a rival’s pause deploys compute to do alignment or to defect. Restricting fabs or datacenters is definitely stronger than software monitoring (it’s much easier to check), but θ∗ is a capability threshold rather than a compute threshold. It seems like this is an unpredictable thing to estimate as well; I wonder if a verification plan built around compute ceilings is just betting that compute-to-capability holds (I don’t mean to argue against the bitter lesson, just that there are presumably softer breakthroughs that could and have occured).
I also think your point on risk changing over time vs. the uncertainty about a fixed risk getting resolved is important. My perspective is that even granting that the risk decreases with time, it is still non-canonical in the sense that I discuss in my other essay Unprecedented Catastrophes Have Non-Canonical Probabilities. The splice model means that data comes in that looks equally reasonable for “evidence of misalignment” and “evidence of alignment”. I don’t think pause buys time to resolve this in a convergence to safety, I think it buys more time for genuine disagreement about whether the pause is working.
My major concern is that pause structurally encourages defectors who are adversely selected for recklessness. The actors presumably are the once least-likely to want to build safe superintelligence because they believe they should cut every corner to pass the threshold first.
Overall, thank you again for the thoughtful comment. For me the biggest question is whether alignment progress is measurable and canonically verifiable.
(Also thanks to those who read and are giving me any feedback. I know this is contrarian, but I am always hopeful of receiving a fair shake here!)
Good post, I’ve thought a lot along similar lines. Except I lean further left than most of LW, and don’t trust governments and corporations at all, so my preferred solution would be entangling AI training among people, across borders and class lines as much as possible.
It does lead to some surprising conclusions. For example, someone who works on alignment at BigCo is making the world a worse place (because BigCo’s owners will just use the alignment work to race harder, as happened with RLHF). While someone who works on capability but in an open, GPL-like way is making the world a better place (by removing that capability from the race, making the race less winner-take-all). Counterintuitive, but I think I stand by it.
Thank you so much for the support and read through! Truly made my day.
Have you written about your thoughts on entangling AI among people broadly? I’d love to read more about that idea to think more about it.