There is a risk that AIs are the winners, and no human countries or human individuals win (this could take the form of extinction or permanent disempowerment). And this risk plausibly changes, reduces over time (rather than the uncertainty about a fixed risk getting resolved), as better understanding of how to manage the technical and social problems becomes available. What a ban/pause buys humanity is this improvement in the chances that there are any winners for humanity at all, once sufficiently capable AIs are eventually built.
(This can make unilaterally defecting undesirable, because the defector loses too, gets a worse outcome than if they didn’t defect, there is no temptation to cheat sooner rather than later. And there is a simple if painful way of ensuring verification, if nobody builds fabs that are too advanced, or giant datacenters. There are less painful ways to ensure verification that maintain a lot of the effect.)
An analysis that ignores this factor could be useful in that it explores the possibility of agreement on policy with people who categorically disbelieve literal extinction or permanent disempowerment (where AIs end up controlling almost all resources and the future of humanity is left with scraps, with no possibility remaining of literally ever changing this), or that these risks can get better if humanity takes time and doesn’t build powerful AIs immediately, as soon as technologically possible.
Also, as more understanding accumulates and the world gets closer to building sufficiently capable AIs safely, the same issues would emerge, as in a world that doesn’t significantly risk extinction or permanent disempowerment for all of humanity (or where this risk is constant and isn’t being reduced with a longer or better coordinated ban/pause). So safely coordinating a ban/pause towards its end is also an important class of problems, even in a ban/pause world that solves most of the other problems.
Thanks for the comment, I think you have some great points. I am going to try to explain my responses, even though I suspect they may be as unpopular as the post already is. One quick thing at the outset: my essay is not based on disbelieving extinction or terrible outcomes risk, my claim is pause is dominated even if you take that risk very seriously (which I do.)
My model does assume that V>0, and if the winner might have catastrophic devastation then V is a lottery rather than a clear reward. That said, I think adding alignment risks makes the argument harder to address. The verification problem is the same because you can’t tell if a rival’s pause deploys compute to do alignment or to defect. Restricting fabs or datacenters is definitely stronger than software monitoring (it’s much easier to check), but θ∗ is a capability threshold rather than a compute threshold. It seems like this is an unpredictable thing to estimate as well; I wonder if a verification plan built around compute ceilings is just betting that compute-to-capability holds (I don’t mean to argue against the bitter lesson, just that there are presumably softer breakthroughs that could and have occured).
I also think your point on risk changing over time vs. the uncertainty about a fixed risk getting resolved is important. My perspective is that even granting that the risk decreases with time, it is still non-canonical in the sense that I discuss in my other essay Unprecedented Catastrophes Have Non-Canonical Probabilities. The splice model means that data comes in that looks equally reasonable for “evidence of misalignment” and “evidence of alignment”. I don’t think pause buys time to resolve this in a convergence to safety, I think it buys more time for genuine disagreement about whether the pause is working.
My major concern is that pause structurally encourages defectors who are adversely selected for recklessness. The actors presumably are the once least-likely to want to build safe superintelligence because they believe they should cut every corner to pass the threshold first.
Overall, thank you again for the thoughtful comment. For me the biggest question is whether alignment progress is measurable and canonically verifiable.
(Also thanks to those who read and are giving me any feedback. I know this is contrarian, but I am always hopeful of receiving a fair shake here!)
There is a risk that AIs are the winners, and no human countries or human individuals win (this could take the form of extinction or permanent disempowerment). And this risk plausibly changes, reduces over time (rather than the uncertainty about a fixed risk getting resolved), as better understanding of how to manage the technical and social problems becomes available. What a ban/pause buys humanity is this improvement in the chances that there are any winners for humanity at all, once sufficiently capable AIs are eventually built.
(This can make unilaterally defecting undesirable, because the defector loses too, gets a worse outcome than if they didn’t defect, there is no temptation to cheat sooner rather than later. And there is a simple if painful way of ensuring verification, if nobody builds fabs that are too advanced, or giant datacenters. There are less painful ways to ensure verification that maintain a lot of the effect.)
An analysis that ignores this factor could be useful in that it explores the possibility of agreement on policy with people who categorically disbelieve literal extinction or permanent disempowerment (where AIs end up controlling almost all resources and the future of humanity is left with scraps, with no possibility remaining of literally ever changing this), or that these risks can get better if humanity takes time and doesn’t build powerful AIs immediately, as soon as technologically possible.
Also, as more understanding accumulates and the world gets closer to building sufficiently capable AIs safely, the same issues would emerge, as in a world that doesn’t significantly risk extinction or permanent disempowerment for all of humanity (or where this risk is constant and isn’t being reduced with a longer or better coordinated ban/pause). So safely coordinating a ban/pause towards its end is also an important class of problems, even in a ban/pause world that solves most of the other problems.
Thanks for the comment, I think you have some great points. I am going to try to explain my responses, even though I suspect they may be as unpopular as the post already is. One quick thing at the outset: my essay is not based on disbelieving extinction or terrible outcomes risk, my claim is pause is dominated even if you take that risk very seriously (which I do.)
My model does assume that V>0, and if the winner might have catastrophic devastation then V is a lottery rather than a clear reward. That said, I think adding alignment risks makes the argument harder to address. The verification problem is the same because you can’t tell if a rival’s pause deploys compute to do alignment or to defect. Restricting fabs or datacenters is definitely stronger than software monitoring (it’s much easier to check), but θ∗ is a capability threshold rather than a compute threshold. It seems like this is an unpredictable thing to estimate as well; I wonder if a verification plan built around compute ceilings is just betting that compute-to-capability holds (I don’t mean to argue against the bitter lesson, just that there are presumably softer breakthroughs that could and have occured).
I also think your point on risk changing over time vs. the uncertainty about a fixed risk getting resolved is important. My perspective is that even granting that the risk decreases with time, it is still non-canonical in the sense that I discuss in my other essay Unprecedented Catastrophes Have Non-Canonical Probabilities. The splice model means that data comes in that looks equally reasonable for “evidence of misalignment” and “evidence of alignment”. I don’t think pause buys time to resolve this in a convergence to safety, I think it buys more time for genuine disagreement about whether the pause is working.
My major concern is that pause structurally encourages defectors who are adversely selected for recklessness. The actors presumably are the once least-likely to want to build safe superintelligence because they believe they should cut every corner to pass the threshold first.
Overall, thank you again for the thoughtful comment. For me the biggest question is whether alignment progress is measurable and canonically verifiable.
(Also thanks to those who read and are giving me any feedback. I know this is contrarian, but I am always hopeful of receiving a fair shake here!)