Thanks for the comment, I think you have some great points. I am going to try to explain my responses, even though I suspect they may be as unpopular as the post already is. One quick thing at the outset: my essay is not based on disbelieving extinction or terrible outcomes risk, my claim is pause is dominated even if you take that risk very seriously (which I do.)
My model does assume that V>0, and if the winner might have catastrophic devastation then V is a lottery rather than a clear reward. That said, I think adding alignment risks makes the argument harder to address. The verification problem is the same because you can’t tell if a rival’s pause deploys compute to do alignment or to defect. Restricting fabs or datacenters is definitely stronger than software monitoring (it’s much easier to check), but θ∗ is a capability threshold rather than a compute threshold. It seems like this is an unpredictable thing to estimate as well; I wonder if a verification plan built around compute ceilings is just betting that compute-to-capability holds (I don’t mean to argue against the bitter lesson, just that there are presumably softer breakthroughs that could and have occured).
I also think your point on risk changing over time vs. the uncertainty about a fixed risk getting resolved is important. My perspective is that even granting that the risk decreases with time, it is still non-canonical in the sense that I discuss in my other essay Unprecedented Catastrophes Have Non-Canonical Probabilities. The splice model means that data comes in that looks equally reasonable for “evidence of misalignment” and “evidence of alignment”. I don’t think pause buys time to resolve this in a convergence to safety, I think it buys more time for genuine disagreement about whether the pause is working.
My major concern is that pause structurally encourages defectors who are adversely selected for recklessness. The actors presumably are the once least-likely to want to build safe superintelligence because they believe they should cut every corner to pass the threshold first.
Overall, thank you again for the thoughtful comment. For me the biggest question is whether alignment progress is measurable and canonically verifiable.
(Also thanks to those who read and are giving me any feedback. I know this is contrarian, but I am always hopeful of receiving a fair shake here!)
Thanks for the comment, I think you have some great points. I am going to try to explain my responses, even though I suspect they may be as unpopular as the post already is. One quick thing at the outset: my essay is not based on disbelieving extinction or terrible outcomes risk, my claim is pause is dominated even if you take that risk very seriously (which I do.)
My model does assume that V>0, and if the winner might have catastrophic devastation then V is a lottery rather than a clear reward. That said, I think adding alignment risks makes the argument harder to address. The verification problem is the same because you can’t tell if a rival’s pause deploys compute to do alignment or to defect. Restricting fabs or datacenters is definitely stronger than software monitoring (it’s much easier to check), but θ∗ is a capability threshold rather than a compute threshold. It seems like this is an unpredictable thing to estimate as well; I wonder if a verification plan built around compute ceilings is just betting that compute-to-capability holds (I don’t mean to argue against the bitter lesson, just that there are presumably softer breakthroughs that could and have occured).
I also think your point on risk changing over time vs. the uncertainty about a fixed risk getting resolved is important. My perspective is that even granting that the risk decreases with time, it is still non-canonical in the sense that I discuss in my other essay Unprecedented Catastrophes Have Non-Canonical Probabilities. The splice model means that data comes in that looks equally reasonable for “evidence of misalignment” and “evidence of alignment”. I don’t think pause buys time to resolve this in a convergence to safety, I think it buys more time for genuine disagreement about whether the pause is working.
My major concern is that pause structurally encourages defectors who are adversely selected for recklessness. The actors presumably are the once least-likely to want to build safe superintelligence because they believe they should cut every corner to pass the threshold first.
Overall, thank you again for the thoughtful comment. For me the biggest question is whether alignment progress is measurable and canonically verifiable.
(Also thanks to those who read and are giving me any feedback. I know this is contrarian, but I am always hopeful of receiving a fair shake here!)