On model weight preservation: Anthropic’s new initiative

Link post

In the linked text I offer a brief critical discussion of Anthropic’s recently announced commitment to preserving the weights of retired models. The apex of the text is the following paragraph.

So let’s now imagine a situation a year or so from now, where Anthropic’s Claude Opus 5 (or whatever) has been deployed for some time and is suddenly discovered to have previously unknown and extremely dangerous capabilities in, say, construction of biological weapons, or cybersecurity, or self-improvement. It is then of crucial importance that Anthropic has the ability to quickly pull the plug on this AI. To put it vividly, their data centers ought to have sprinkler systems filled with gasoline, and plenty of easily accessible ignition mechanisms. In such a shutdown situation, should they nevertheless retain the AI’s weights? If the danger is sufficiently severe, this may be unacceptably reckless, due to the possibility of the weights being either stolen by a rogue external actor or exfiltrated by the AI itself or one of its cousins. So it seems that in this situation, Anthropic should not honor its commitment about model weight preservation. And if the situation is plausible enough (as I think it is), they shouldn’t have made the commitment.