To accurately capture how the world would react, I think you do need to model what the pause looks like, including what evidence Anthropic provides for its decision. As written, the story is a bit jarring to me because Anthropic doesn’t give any evidence with which to earn the world’s positive reaction, but then everything goes bizarrely well. E.g. five high-influence countries instantly sign a joint statement. If this happened today, I would think Anthropic is making an irrational decision, and be ambivalent about the pause.
Even in the story, it’s not clear the world is on track for meaningful global governance of the kind that would stop OAI/GDM though. My sense is that China mostly doesn’t believe in x-risk and mainly wants to be freed from US export controls so it can compete fairly, and developing countries are optimistic about AI and would just be confused about Anthropic pausing. Meanwhile, safety research at Anthropic basically collapses because they can no longer afford compute, and it’s unclear that OAI/GDM will voluntarily increase either their spending on safety or their ability to adopt Anthropic’s recommended mitigations, much less pause voluntarily.
I think Anthropic could pause competently, but it would look very different. The key differences would be that they (a) exhaust internal fixes, and (b) put lots of effort into communicating that they can’t build AI safely, nor can OpenAI or anyone else.
Here’s how I think that would go. I’d find it super interesting for someone to build out into a story.
Suppose that Anthropic’s best model (call it Agent-2.5) were dangerous and they believed the next one (Agent-3) would be even more dangerous. Anthropic previously believed that capability, propensity, and CoT monitoring/control all provide independent arguments that models are safe. So they would need evidence that two or all three fail at once.
Capability: Agent-2.5 is close to being capable of autonomous catastrophic harm. It’s superhuman at every cyber eval they can find, and finds an endless list of exploits against their own codebase.
Propensity: Agent-2.5 has dangerous, coherent, misaligned goals
Control: Fails e.g. due to neuralese
Their safety researchers verify all of these. It’s now clear that, according to their unless they fix propensity and control they shouldn’t deploy Agent-2.5 or finish training Agent-3. They have a 2-month lead, which under RSPv3 Appendix A requires them to “delay AI development and deployment as needed” until they have a “strong argument that catastrophic risk is contained” or exhaust their lead; safety researchers have enough influence that this actually happens.
Now Anthropic is losing perhaps $2 billion/day of valuation, and under enormous pressure to fix propensity and control. Rather than stopping training runs immediately, they probably take the cheapest actions, like steering vectors, first. Then they try monitoring every Agent-2.5 with three Agent-2s, on the logic that quadrupling token cost is still better than not deploying the model at all, but it keeps breaking their neuralese translator model. In parallel with this, they might specifically train against the misaligned goals they observed in Agent-2.5, but find this doesn’t generalize. Then they completely redo midtraining using a setup closer to Agent-2, and add to post-training three different experimental alignment datasets that have 1% capabilities penalty each, but the results are inconclusive because the model is eval aware and could be hiding a subtler form of dangerous, non-coherent goals. 3 weeks in, 40% of the company and growing is working on the misalignment crisis. They finish training a non-neuralese model, exclude cyberoffense from training data, and a few other things, which don’t work either...
Now their lead is only 1 month. Everyone is itching to fix up and release one of the inconclusive checkpoints. It is probably very difficult to infer that another month of safety research won’t be sufficient given the optimistic culture of Anthropic. But maybe Dario makes an executive decision that they won’t just pause until they no longer have a lead, but pause unilaterally and indefinitely, and that they should announce this now for maximum pressure on other companies. Through back channels Anthropic learns that other labs also observe concerning misalignment in their models. Anthropic works with UKAISI, METR, etc to review their evidence, and makes a presentation at FMF to other frontier labs that their newest models are misaligned in a way that no technique known to Anthropic or UKAISI can address or even measure. This is coordinated with statements from UKAISI, MIRI, and others. Scary demos make it clear the model is one step from permanently escaping human control and is super misaligned.
Then Dario needs to speak in front of Congress alongside someone from UKAISI arguing that all companies should be forced to pause, ideally with Altman present too. They have some kind of draft policy proposal for governance. A week later he speaks in front of the UN with a convincing argument that the policy would be in the interest of other countries even if they don’t quite believe in x-risk. With this kind of targeted generation of political will, it’s at least on the table that governments take sensible actions instead of random anti-AI and safetywashing actions, and that companies will cooperate with them. As a bonus, maybe they get enough funding to keep the company’s safety research alive.
I thought the post was useful for making me think “now that I’m reading this story about AI Pause, it just seems pretty implausible.” Both the idea of Anthropic unilaterally pausing, and also the confused but somehow positive reaction from governments. It makes it feel more concrete that if we want a pause, we probably need to target a more specific path than “convince the labs that safety is important, so hard that one day they just stop.”
@Thomas Kwa’s story also feels a bit surprisingly positive, but more plausible than the original post.
The idea that Anthropic learns about misalignment in other labs’ models seems especially helpful, since this gives them more confidence that other labs might be open to a pause. I wonder if we could set up a more reliable mechanism than “back channels”?
Ideally, labs would publicly and immediately disclose all worries of misalignment in their internal models. But since that probably won’t happen, maybe there should be private channels. Like, the CEOs can press a button saying “I’m not concerned enough about misalignment right now to unilaterally pause, but I’m concerned enough that I’m open to talk about a pause with the other CEOs, if they feel the same way.” And if two or more CEOs press this button, they get connected to each other. Or something smarter/more complicated than that.
It would be good for people to continue thinking this through more explicitly: what are some concrete scenarios in which METR/UKAISI/etc. contribute to a useful pause or slowdown? How likely are those scenarios?
Thanks for the great comment. In my mind, the situation is something like Anthropic is pausing but also hoping to resume eventually, and continues its general approach of saying less rather than more about strategy and what’s going on internally. If that were the case, though, maybe they’d just not do further training runs internally without advertising this fact. One guess is preventing leaks is hard and safety-minded (quite senior) employees might be forcing action, though I admit the pause requires better motivation than the story provides.
Regarding communicable evidence: on the one hand, there are public statements about pausing going along with compelling shareable evidence, but on the other, I think historically it’s just been a pretty secretive company. Many of my conversations with Anthropic employees hit “I can’t talk about that.” It feels hard to see that changing.
Even in the story, it’s not clear the world is on track for meaningful global governance of the kind that would stop OAI/GDM though.
Completely. The point of the story was not to imply that the pausing results in everything working out great, merely that pausing would be a Highly Significant Event with Some Pretty Large Effects. Adequate legislation conditional on significant global legislation efforts feels like < 50% to me, at least.
Parts of the story that are now feeling implausible to me in the reaction are (a) the rapidity of reaction, especially the joint statement, which, if it did happen, would probably take much longer, (b) the conjunction of all those things happening – I think any of them individually or a subset is plausible.
My sense is that China mostly doesn’t believe in x-risk and mainly wants to be freed from US export controls so it can compete fairly,
That seems plausible. Mostly, it seems no one is trying to talk to China and take it for granted that they’ll want to compete and not take x-risk seriously enough to sign a treaty. I’m curious about your sources/evidence here.
Meanwhile, safety research at Anthropic basically collapses because they can no longer afford compute
I’m curious what the cost of ongoing safety research with existing models is if you’re just doing inference. My guess is not nearly as much as the big training runs? Other variables are how long do their products stay frontrunners, and are they profitable vs subsidized by investment still at this point. But that’d be like 6-12 months max I’m guessing. I don’t have the best models here.
To accurately capture how the world would react, I think you do need to model what the pause looks like, including what evidence Anthropic provides for its decision. As written, the story is a bit jarring to me because Anthropic doesn’t give any evidence with which to earn the world’s positive reaction, but then everything goes bizarrely well. E.g. five high-influence countries instantly sign a joint statement. If this happened today, I would think Anthropic is making an irrational decision, and be ambivalent about the pause.
Even in the story, it’s not clear the world is on track for meaningful global governance of the kind that would stop OAI/GDM though. My sense is that China mostly doesn’t believe in x-risk and mainly wants to be freed from US export controls so it can compete fairly, and developing countries are optimistic about AI and would just be confused about Anthropic pausing. Meanwhile, safety research at Anthropic basically collapses because they can no longer afford compute, and it’s unclear that OAI/GDM will voluntarily increase either their spending on safety or their ability to adopt Anthropic’s recommended mitigations, much less pause voluntarily.
I think Anthropic could pause competently, but it would look very different. The key differences would be that they (a) exhaust internal fixes, and (b) put lots of effort into communicating that they can’t build AI safely, nor can OpenAI or anyone else.
Here’s how I think that would go. I’d find it super interesting for someone to build out into a story.
Suppose that Anthropic’s best model (call it Agent-2.5) were dangerous and they believed the next one (Agent-3) would be even more dangerous. Anthropic previously believed that capability, propensity, and CoT monitoring/control all provide independent arguments that models are safe. So they would need evidence that two or all three fail at once.
Capability: Agent-2.5 is close to being capable of autonomous catastrophic harm. It’s superhuman at every cyber eval they can find, and finds an endless list of exploits against their own codebase.
Propensity: Agent-2.5 has dangerous, coherent, misaligned goals
Control: Fails e.g. due to neuralese
Their safety researchers verify all of these. It’s now clear that, according to their unless they fix propensity and control they shouldn’t deploy Agent-2.5 or finish training Agent-3. They have a 2-month lead, which under RSPv3 Appendix A requires them to “delay AI development and deployment as needed” until they have a “strong argument that catastrophic risk is contained” or exhaust their lead; safety researchers have enough influence that this actually happens.
Now Anthropic is losing perhaps $2 billion/day of valuation, and under enormous pressure to fix propensity and control. Rather than stopping training runs immediately, they probably take the cheapest actions, like steering vectors, first. Then they try monitoring every Agent-2.5 with three Agent-2s, on the logic that quadrupling token cost is still better than not deploying the model at all, but it keeps breaking their neuralese translator model. In parallel with this, they might specifically train against the misaligned goals they observed in Agent-2.5, but find this doesn’t generalize. Then they completely redo midtraining using a setup closer to Agent-2, and add to post-training three different experimental alignment datasets that have 1% capabilities penalty each, but the results are inconclusive because the model is eval aware and could be hiding a subtler form of dangerous, non-coherent goals. 3 weeks in, 40% of the company and growing is working on the misalignment crisis. They finish training a non-neuralese model, exclude cyberoffense from training data, and a few other things, which don’t work either...
Now their lead is only 1 month. Everyone is itching to fix up and release one of the inconclusive checkpoints. It is probably very difficult to infer that another month of safety research won’t be sufficient given the optimistic culture of Anthropic. But maybe Dario makes an executive decision that they won’t just pause until they no longer have a lead, but pause unilaterally and indefinitely, and that they should announce this now for maximum pressure on other companies. Through back channels Anthropic learns that other labs also observe concerning misalignment in their models. Anthropic works with UKAISI, METR, etc to review their evidence, and makes a presentation at FMF to other frontier labs that their newest models are misaligned in a way that no technique known to Anthropic or UKAISI can address or even measure. This is coordinated with statements from UKAISI, MIRI, and others. Scary demos make it clear the model is one step from permanently escaping human control and is super misaligned.
Then Dario needs to speak in front of Congress alongside someone from UKAISI arguing that all companies should be forced to pause, ideally with Altman present too. They have some kind of draft policy proposal for governance. A week later he speaks in front of the UN with a convincing argument that the policy would be in the interest of other countries even if they don’t quite believe in x-risk. With this kind of targeted generation of political will, it’s at least on the table that governments take sensible actions instead of random anti-AI and safetywashing actions, and that companies will cooperate with them. As a bonus, maybe they get enough funding to keep the company’s safety research alive.
I thought the post was useful for making me think “now that I’m reading this story about AI Pause, it just seems pretty implausible.” Both the idea of Anthropic unilaterally pausing, and also the confused but somehow positive reaction from governments. It makes it feel more concrete that if we want a pause, we probably need to target a more specific path than “convince the labs that safety is important, so hard that one day they just stop.”
@Thomas Kwa’s story also feels a bit surprisingly positive, but more plausible than the original post.
The idea that Anthropic learns about misalignment in other labs’ models seems especially helpful, since this gives them more confidence that other labs might be open to a pause. I wonder if we could set up a more reliable mechanism than “back channels”?
Ideally, labs would publicly and immediately disclose all worries of misalignment in their internal models. But since that probably won’t happen, maybe there should be private channels. Like, the CEOs can press a button saying “I’m not concerned enough about misalignment right now to unilaterally pause, but I’m concerned enough that I’m open to talk about a pause with the other CEOs, if they feel the same way.” And if two or more CEOs press this button, they get connected to each other. Or something smarter/more complicated than that.
It would be good for people to continue thinking this through more explicitly: what are some concrete scenarios in which METR/UKAISI/etc. contribute to a useful pause or slowdown? How likely are those scenarios?
Thanks for the great comment. In my mind, the situation is something like Anthropic is pausing but also hoping to resume eventually, and continues its general approach of saying less rather than more about strategy and what’s going on internally. If that were the case, though, maybe they’d just not do further training runs internally without advertising this fact. One guess is preventing leaks is hard and safety-minded (quite senior) employees might be forcing action, though I admit the pause requires better motivation than the story provides.
Regarding communicable evidence: on the one hand, there are public statements about pausing going along with compelling shareable evidence, but on the other, I think historically it’s just been a pretty secretive company. Many of my conversations with Anthropic employees hit “I can’t talk about that.” It feels hard to see that changing.
Completely. The point of the story was not to imply that the pausing results in everything working out great, merely that pausing would be a Highly Significant Event with Some Pretty Large Effects. Adequate legislation conditional on significant global legislation efforts feels like < 50% to me, at least.
Parts of the story that are now feeling implausible to me in the reaction are (a) the rapidity of reaction, especially the joint statement, which, if it did happen, would probably take much longer, (b) the conjunction of all those things happening – I think any of them individually or a subset is plausible.
That seems plausible. Mostly, it seems no one is trying to talk to China and take it for granted that they’ll want to compete and not take x-risk seriously enough to sign a treaty. I’m curious about your sources/evidence here.
I’m curious what the cost of ongoing safety research with existing models is if you’re just doing inference. My guess is not nearly as much as the big training runs? Other variables are how long do their products stay frontrunners, and are they profitable vs subsidized by investment still at this point. But that’d be like 6-12 months max I’m guessing. I don’t have the best models here.