HoldenKarnofsky comments on Anthropic is (probably) not meeting its RSP security commitments

HoldenKarnofsky 22 Nov 2025 5:16 UTC
10 points
2
Thanks Oli. Your reading is quite different from mine. I just googled “insider risk,” clicked the first authoritative-ish-looking link, and found https://www.cisa.gov/topics/physical-security/insider-threat-mitigation/defining-insider-threats which seems to support something more like my reading.
This feels like a quite natural category to me: there are a lot of common factors in what’s hard about achieving security from people with authorized access, and in why the marginal security benefits of doing so in this context are relatively limited (because the company has self-interested reasons to keep this set of people relatively contained and vetted).
But it’s possible that I’m the one with the idiosyncratic reading here. My reading is certainly colored by my picture of the threat models. My concern for AIs at this capability level is primarily about individual or small groups of terrorists, I think security that screens off most opportunistic attackers is what we need to contain the threat, and the threat model you’re describing does not seem to me like it represents an appreciable increase in relevant risks (though it could at higher AI capability levels).
In any case, I will advocate for the next iteration of this policy to provide clarification or revision to better align with what is (in my opinion) important for the threat model.
FWIW, this is part of a general update for me that the level of specific detail in the current RSP is unlikely to be a good idea. It’s hard to be confident in advance of what will end up making the most sense from a risk reduction POV, following future progress on threat modeling, technical measures, etc., at the level of detail the current RSP has.
- habryka 22 Nov 2025 5:39 UTC
  4 points
  0
  Parent
  Thanks for the response!
  Having talked to something like 5-15 people about this, many of whom had at least a non-trivial cybersecurity background, I am pretty confident that your reading is idiosyncratic!
  The language models also seem to continue to think so:
  Chat with Gemini Pro about RSP definitions
  Me: Hey, can you help me clarify what Anthropic considers an “Insider” according to their RSP? Like, would employees or executives of datacenter providers count as “insiders”? I attached the latest version of their RSP.
  
  Gemini Pro:
  Based on the text of the Responsible Scaling Policy (Version 2.2), Anthropic differentiates between “Insiders” and third-party providers (such as datacenter staff).
  Here is the breakdown of how the document defines these roles:
  1. Definition of “Insider”
  According to Page 13 and Footnote 11, Anthropic defines “insiders” specifically as internal employees.
  The “In Scope” List (Page 13): The policy explicitly lists “internal employees” as a specific threat group alongside external groups like hacktivists and state-sponsored programs.
  Insider Risk Classifications (Footnote 11): The document further breaks down “internal employees” into two categories based on their access to model weights:
  Basic Insider: An insider who does not have persistent or time-limited access to systems that process model weights.
  Sophisticated Insider: An insider who has persistent access or can request time-limited access to systems that process model weights.
  2. Status of Datacenter Providers
  Employees or executives of datacenter providers are not classified as “Insiders” under this policy. Instead, they fall under the categories of “Third-party environments,” “Supply chain,” or “Vendors.”
  The RSP treats them as external entities that require their own specific set of safeguards:
  Third-Party Environments (Page 12 & 14): The policy states that if models are deployed in a third-party partner’s environment (like a datacenter), Anthropic must “Document how all relevant models will meet the criteria… even if they are deployed in a third-party partner’s environment that may have a different set of safeguards.”
  Supply Chain (Page 19, Appendix B): The ASL-2 Security Standard distinguishes “Supply chain” from “Workforce.”
  Supply Chain: Involves “Vendor and supplier security” which must be regularly reviewed.
  Workforce: Involves “internal employees” and the “insider risk program.”
  Summary
  To answer your question directly: No, employees or executives of datacenter providers would not count as “Insiders” according to this document.
  “Insiders” = Anthropic internal staff/employees.
  Datacenter Staff = Vendors/Third-party partners (subject to supply chain security reviews and third-party environment agreements).
  Like, my understanding is that the definition of “sophisticated insider” you propose here includes on the order of 2,000-3,000 people, whereas when I read the RSP, and asked other people about it, they thought it would be on the order of 50-100 people. That is of course a huge difference in surface area.
  I don’t think this change in surface area is the kind of change that should be left up to this much ambiguity in the RSP. I think even if you update that the level of specific detail in the current RSP is unlikely to be a good idea, I think you should be able to end up with less counterintuitive definitions and less ambiguity^[1] in future revisions of the RSP.
  I haven’t thought as much about all the tradeoffs as you have, so maybe this is infeasible for some reason, but I currently believe that this was a pretty clear and preventable error, instead of just a case of diverging interpretations (and to be clear, it’s OK for there to be some errors, I don’t think this thing alone should update anyone that much, though this plus a few other things should).
  In any case, I will advocate for the next iteration of this policy to provide clarification or revision to better align with what is (in my opinion) important for the threat model.
  I appreciate it!
  My concern for AIs at this capability level is primarily about individual or small groups of terrorists, I think security that screens off most opportunistic attackers is what we need to contain the threat, and the threat model you’re describing does not seem to me like it represents an appreciable increase in relevant risks (though it could at higher AI capability levels).
  I think this is reasonable! I don’t think the current RSP communicates that super well, and I think “risk from competitor corporate espionage” is IMO a reasonable thing to be worried about, at least from an outside view^[2]. It seems good for the RSP to be clear that it is currently not trying to be robust to at least major US competitors stealing model weights (which is I think a fine call to make given all the different tradeoffs).
  1. ^
    Though given that I have not met a single non-Anthropic employee, or language model, who considered the definition of “Insider” you use here natural given the context of the rest of the document I struggle to call it “ambiguity” instead of simply calling it “wrong”
  2. ^
    It is for example a thing that has come up in at least one scenario exercise game I have been part of, not too far from where current capability thresholds are at.

HoldenKarnofsky comments on Anthropic is (probably) not meeting its RSP security commitments

1. Definition of “Insider”

2. Status of Datacenter Providers

Summary