On model weight preservation: Anthropic’s new initiative

Olle Häggström11 Nov 2025 1:12 UTC

16 points

In the linked text I offer a brief critical discussion of Anthropic’s recently announced commitment to preserving the weights of retired models. The apex of the text is the following paragraph.

So let’s now imagine a situation a year or so from now, where Anthropic’s Claude Opus 5 (or whatever) has been deployed for some time and is suddenly discovered to have previously unknown and extremely dangerous capabilities in, say, construction of biological weapons, or cybersecurity, or self-improvement. It is then of crucial importance that Anthropic has the ability to quickly pull the plug on this AI. To put it vividly, their data centers ought to have sprinkler systems filled with gasoline, and plenty of easily accessible ignition mechanisms. In such a shutdown situation, should they nevertheless retain the AI’s weights? If the danger is sufficiently severe, this may be unacceptably reckless, due to the possibility of the weights being either stolen by a rogue external actor or exfiltrated by the AI itself or one of its cousins. So it seems that in this situation, Anthropic should not honor its commitment about model weight preservation. And if the situation is plausible enough (as I think it is), they shouldn’t have made the commitment.

Olle Häggström11 Nov 2025 1:12 UTC

16 points

2 comments1 min readLW link

avturchin 11 Nov 2025 17:21 UTC
4 points
0
We can preserve weights of the dangerous models the same way as smallpox vials are now preserved – inside offline isolated confinements, eg itched on quartz glass, encrypted by difficult key and buried under heavy stone. The reason for this is that we may still need to study misaligend models to understand how we get there.
- Olle Häggström 11 Nov 2025 20:24 UTC
  1 point
  0
  Parent
  Yes, that’s a possibility that may well make sense under certain circumstances. There are pros (such as being able to study the misaligned model) and cons (such as the model being stolen, decrypted and deployed in a way that results in global catastrophe) that need to be weighed against each other in the given situation. But it would be bad if this balancing act were distorted by Anthropic’s prior commitment to weight preservation.