HoldenKarnofsky comments on Anthropic is (probably) not meeting its RSP security commitments

HoldenKarnofsky 21 Jan 2026 4:20 UTC
2 points
0
There are a number of things you say here that don’t seem right to me and/or aren’t capturing the intent of what I said. I prefer not to get into all of it, but just a couple of notes:
- My current impression is that we are “highly protected against most attackers’ attempts at stealing model weights,” specifically highly protected against the groups listed as “in scope” (which I think of as including employees at partner orgs who have physical access to machines but not authorized access to weights), and broadly in line with the letter and spirit of the ASL-3 Security Standard. This isn’t my call and I am not up on all of the details of how we’ve vetted the security controls for partners, but it’s my impression.
- An attacker being out of scope for the ASL-3 Security Standard does not meant “Anthropic wouldn’t consider it a security incident” if they stole (i.e., exfiltrated/improperly used) important assets.
- habryka 21 Jan 2026 19:12 UTC
  2 points
  0
  Parent
  Thank you for the clarification!
  In particular the first bullet point seems important and clear. I currently think this is unlikely to be true (assuming that e.g. most people in datacenter management and executives at these companies do not have authorized access to weights), but I don’t really know how to progress from here. I might write more if I happen to talk to more people in the field about it.
  An attacker being out of scope for the ASL-3 Security Standard does not meant “Anthropic wouldn’t consider it a security incident” if they stole (i.e., exfiltrated/improperly used) important assets.
  That makes sense, though to be clear I was not trying to equate those two. I was saying “Anthropic wouldn’t consider it a security incident if someone with authorized model access were to use those weights how they see fit”. I.e. I was equating authorized access with Anthropic wouldn’t consider it a security incident if they did stuff with the weights.
  But thinking more about it, it does seem like there is a natural difference between “authorized access to model weights” and “authorized to transfer model weights to new machines” or “authorized to perform operations on model weights without extensive logging”, and it makes sense to treat the latter as a security breach even if someone is authorized to access model weights in some sense.
  This still leaves me in a kind of confused spot with regards to the security model here. From my perspective this still leaves hundreds of people^[1] in the world who have both opportunity and motive to gain access to Anthropic model weights, with a bunch of people clearly outside of Anthropic and with misaligned interests to Anthropic being labeled “sophisticated insiders” and therefore excluded from the threat model in a way that really isn’t obvious from reading the RSP.
  And it’s not like I have no sympathy for the difficulty of getting this all right, but the attack surface here feels very different than the one I was expecting to be covered when reading the RSP.
  Overall, again, thanks for you taking the time to clarify things here. Given the first point it does seem like we have a disagreement about whether Anthropic is currently meeting its commitments, but it’s not super clear whether it’s worth either of our time to dig into it more.
  1. ^
    Maybe only tens, since I don’t actually know who you currently consider to have authorized access to model weights at these other companies, which I think would be less concerning, though doesn’t change things that much if it e.g. includes all the top-level executives at these other companies who have the biggest motive.
  - HoldenKarnofsky 27 Jan 2026 6:14 UTC
    4 points
    0
    Parent
    Nothing more to add for now, thanks for the response!