Anthropic is (probably) not meeting its RSP security commitments
TLDR: An AI company’s model weight security is at most as good as its compute providers’ security. Anthropic has committed (with a bit of ambiguity, but IMO not that much ambiguity) to be robust to attacks from corporate espionage teams at companies where it hosts its weights. Anthropic seems unlikely to be robust to those attacks. Hence they are probably in violation of their RSP.
Note: Jason Clinton, Deputy CISO of Anthropic left a note on this post. I am not quite sure what to make of it, but it’s probably valuable context.
Context: I shared (approximately) this report with Anthropic 1-2 months ago. Zach Stein Perlman also made a public comment about it in his shortform 2+ months ago.
Anthropic is committed to being robust to attacks from corporate espionage teams (which includes corporate espionage teams at Google, Microsoft and Amazon)
From the Anthropic RSP:
When a model must meet the ASL-3 Security Standard, we will evaluate whether the measures we have implemented make us highly protected against most attackers’ attempts at stealing model weights.
We consider the following groups in scope: hacktivists, criminal hacker groups, organized cybercrime groups, terrorist organizations, corporate espionage teams, internal employees, and state-sponsored programs that use broad-based and non-targeted techniques (i.e., not novel attack chains).
[...]
We will implement robust controls to mitigate basic insider risk, but consider mitigating risks from sophisticated or state-compromised insiders to be out of scope for ASL-3. We define “basic insider risk” as risk from an insider who does not have persistent or time-limited access to systems that process model weights. We define “sophisticated insider risk” as risk from an insider who has persistent access or can request time-limited access to systems that process model weights.
My best understanding is the RSP commits Anthropic to being robust to attackers from corporate espionage teams (as included in the list above).
The RSP mentions “insiders” as a class of person Anthropic is promising less robustness from, but doesn’t fully define the term. Everyone I have talked to about the RSP interpreted “insiders” to mean “people who work at Anthropic”. I think it would be a big stretch for “insiders” to include “anyone working at any organization we work with that has persistent access to systems that process model weights”. As such, I think it’s pretty clear that “insiders” is not intended to include e.g. Amazon AWS employees, or Google employees.
Claude agrees with this interpretation: https://claude.ai/share/b7860f42-bef1-4b28-bf88-8ca82722ce82
One could potentially make the argument that Google, Microsoft and Amazon should be excluded on the basis of the “highly sophisticated attacker” carve-out:
The following groups are out of scope for the ASL-3 Security Standard because further testing (as discussed below) should confirm that the model would not meaningfully increase their ability to do harm: state-sponsored programs that specifically target us (e.g., through novel attack chains or insider compromise) and a small number (~10) of non-state actors with state-level resourcing or backing that are capable of developing novel attack chains that utilize 0-day attacks.
Multiple people I talked to thought this was highly unlikely. These attacks would not be nation-state backed, would not require developing novel attack chains that use 0-day attacks, and if you include Amazon and Google in this list of non-state actors, it seems very hard to limit the total list of organizations with that amount of cyber security offense capacity or more to “~10”.
Again, Claude agrees: https://claude.ai/share/a6068000-0a82-4841-98e9-457c05379cc2
Claude weights that are covered by ASL-3 security requirements are shipped to many Amazon, Google, and Microsoft data centers
Based on the availability of Claude in a wide variety of AWS regions, it appears that Claude weights, for the purpose of inference, are shipped to a large number of Amazon data centers.
Similarly, based on the availability of Claude in a wide variety of Google Cloud regions, the weights are shipped to a large number of Google data centers.
Furthermore, based on the just-announced availability of Claude on Microsoft Foundry, the weights are shipped to a large number of Microsoft data centers.
This suggests strongly that Claude weights are being processed in ordinary and normal Google, Amazon and Microsoft data centers, without the kind of extensive precautions that are e.g. part of things like the high-security governance clouds.
As an example, I think it’s quite unlikely Anthropic has access to cameras or direct access to verifiably untamperable access logs of who physically accessed all inference machines that host Claude weights (and even if they do, would have little ability to confirm the routing and physical setup of the machines to make sure they are not being lied to about the physical locations of the servers, camera coverage, or accuracy of the access logs).
This means given executive buy-in by a high-level Amazon, Microsoft or Google executive, their corporate espionage team would have virtually unlimited physical access to Claude inference machines that host copies of the weights
I think the above implies that if a corporate espionage team was given the order by a high-level executive (which seems usually the case when corporate espionage teams do things) to extract Claude’s weights, they would have virtually unlimited physical access to machines that host them, by e.g. making minor modifications to access logs or redirecting network traffic for provisioning a new instance to a different machine, etc. in at least one datacenter that Amazon and Google run.
With unlimited physical access, a competent corporate espionage team at Amazon, Microsoft or Google could extract weights from an inference machine, without too much difficulty
Protecting a machine against privilege escalation or at least getting dumps of its memory, if you have unlimited physical access to the machine, is extremely difficult. Anthropic has written some about what would need to be done to make that closer to impossible in this report: https://assets.anthropic.com/m/c52125297b85a42/original/Confidential_Inference_Paper.pdf
Most of the things in the report are not currently standard practice in data centers, and the framing of the report (as well as its timing) reads to me as indicating that the data centers that Anthropic uses are not fully compliant with the recommendations that Anthropic is asking for.
Given all of the above, this means Anthropic is in violation of its most recent RSP
Given that the RSP commits Anthropic to being robust to attacks from corporate espionage teams that are not part of a very small number of nation-state backed or otherwise extremely sophisticated hacking teams, it seems that the fact that Amazon and Google corporate espionage teams could currently extract Claude weights without too much difficulty (if at great business and reputation risk), would put Anthropic in violation of its RSP.
To be clear, don’t think this is that reckless of a choice (I don’t even know whether I think Anthropic having good security is good or bad for the world). It merely seems to me that the RSP as written commits Anthropic to a policy here that seems incompatible with what Anthropic is actually doing.
Postscript
My best guess is starting today, with the announcement of Anthropic’s partnership with Microsoft and Nvidia to host Claude on Microsoft data centers, that Anthropic is actually in much more intense violation of its RSP, where not only could a team at Google and Amazon get access to Claude weights with executive buy-in, my guess is many non-nation-state-capability-actors are now able to get access to Claude’s weights.
This is because, as I understand, Microsoft datacenter security is generally known to be substantially worse than Google’s or Amazon’s, and this at least to me creates substantial suspicion that a broader set of attackers is capable of breaking those defenses.
I am really not confident of this, which is why it’s here in this postscript. Miles Brundage expressing concern about this is what caused me to clean up the above, which was originally a message I sent to an Anthropic employee, for public consumption. It appears the current default trajectory for the past few months have been for Anthropic to weaken, not strengthen, its robustness to external attackers, so it seemed more urgent to publish.
Previous discussion on this topic can be found on this quick take by Zach.
Hi Habryka, thank you for holding us accountable. We do extend ASL-3 protections to all of our deployment environments and cloud environments are no different. We haven’t made exceptions to ASL-3 requirements for any of the named deployments, nor have we said we would treat them differently. If we had, I’d agree that we would have been in violation. But we haven’t. Eventually, we will do so for ASL-4+. I hope that you appreciate that I cannot say anything about specific partnerships.
(I made an account just for this so please don’t expect me to respond here. Feel free to reach out to me via email if you have any follow-up questions.)
Thanks for responding! I understand you to be saying that you feel confident that even with high-level executive buy in at Google, Microsoft or Amazon, none of the data center providers you use would be able to extract the weights of your models. Is that correct?
If so, I totally agree that that would put you in compliance with your ASL-3 commitments.[1] I understand that you can’t provide details about how you claim to be achieving that, and so I am not going to ask further questions about the details (but would appreciate more information nevertheless).
I do find myself skeptical given just your word, but it can often be tricky with cybersecurity things like this about how to balance the tradeoff between providing verifiable information and opening up more attack surface.
(Am sending the above as an email and will update this thread if Jason responds with something I can share)
At least on any issues I am trying to point out here in this post outside of the postscript
Jason Clinton very briefly responded saying that the May 14th update, which excluded sophisticated insiders from the threat model, addresses the concerns in this post, plus some short off-the-record comments.
Based on this and Holden’s comment, my best interpretation of Anthropic’s position on this issue is that they currently consider employees at compute providers to be “insiders” and executives “sophisticated insiders”, and the latter hereby excluded from Anthropic’s security commitments. They also likely furthermore think that compute providers do not have the capacity to execute attacks like this without very high chance of getting caught and so that this is not a threat model they are concerned about.
As I argue for in the post, defining basically all employees at compute providers to be “insiders” feels like an extreme stretch of the word, and I think has all kinds of other tricky implications for the RSP, but it’s not wholly inconsistent!
I think to bring Anthropic back into compliance with what I think is a common sense reading, I would suggest (in descending goodness):
updating the RSP with a carve-out for major compute providers,
use a different word from “insider” for the class that Anthropic is meaning to exclude in its May update,
or at the very least provide a clear definition of “insider” within the RSP (but largely I would really advocate against using the word “insider” at all here, since I don’t think people expect that to include compute provider employees).
I separately seem to disagree with Anthropic on this being a thing that executives at Google/Amazon/Microsoft could be motivated to do, and a thing that they would succeed at if they tried, but given the broad definition of “Insider” it doesn’t appear to be load bearing for the thesis in the OP. I have written a bit more about it anyways in this comment thread.
Hi Oli, the threat model you’re describing is out of scope for our RSP, as I think the May 14 update (last page) makes clear. This point is separate from Jason’s point about security levels at cloud partners.
(Less importantly, I will register confusion about your threat model here—I don’t think there are teams at these companies whose job is to steal from partners with executive buy-in? Nor do I think this is likely for executives to buy into in general, at least until/unless AI capabilities are far beyond today’s.)
I disagree. I address that section explicitly here:
The Claude transcripts are all using pdfs of the latest version of the RSP. I also ran this by multiple other people and none of them thought it was reasonable for a definition of “Insiders” to apply to employees at major datacenter providers. “Insiders” I think pretty clearly means “employees or at most contractors of Anthropic”. If you end up including employees at organizations Anthropic is working with, you quickly run into a bunch of absurdities and contradictions within the RSP that I think clearly show it must have a definition as narrow as this.
Therefore, I do not see how “This update excludes both sophisticated insiders and state-compromised insiders from the ASL-3 Security Standard.” could exclude employees and executives at Microsoft, Google, or Amazon, unless you define “Insider” to mean “anyone with persistent physical access to machines holding model weights” in which case I would dispute that that is a reasonable definition of “insider”. If Anthropic ships their model weights to another organization, clearly employees of that organization do not by default become “insiders” and executives do not become “sophisticated insiders”. If Anthropic ships their models to another organization and then one of their executives steals the weights, I think Anthropic violated their RSP as stated.
If you are clarifying that Anthropic, according to its RSP, could send unencrypted weights to the CEOs of arbitrary competing tech companies, but with a promise to please not actually use them for anything, and this would not constitute a breach of its RSP because competing tech companies CEOs are “high level insiders” then I think this would be really good to clarify! I really don’t think that would be a natural interpretation of the current RSP (and Claude and multiple people I’ve talked to about this, e.g. @Buck and @Zach Stein-Perlman agree with me here).
I don’t think this is a particularly likely threat model, but also not an implausible one. My position for a long time has been that Anthropic’s RSP security commitments have been unrealistically aggressive (but people around Anthropic have been pushing back on that and saying that security is really important and so Anthropic should make commitments as aggressive as this).
I think it would be a scandal at roughly the scale of the Volkswagen emission scandal if a major lab decided to do something like this, i.e. a really big deal, but not unprecedented. My current guess is that it’s like 50% likely that one of Google, Amazon or Microsoft has corporate espionage teams that would be capable of doing this kind of work, and something like 6% likely that any Microsoft, Google, or Amazon would consider it worth the risk to make an attempt at exfiltrating model weights of a competing organization via some mechanism like this within the next year.
I have written a bit about this in a Twitter discussion:
Thanks Oli. Your reading is quite different from mine. I just googled “insider risk,” clicked the first authoritative-ish-looking link, and found https://www.cisa.gov/topics/physical-security/insider-threat-mitigation/defining-insider-threats which seems to support something more like my reading.
This feels like a quite natural category to me: there are a lot of common factors in what’s hard about achieving security from people with authorized access, and in why the marginal security benefits of doing so in this context are relatively limited (because the company has self-interested reasons to keep this set of people relatively contained and vetted).
But it’s possible that I’m the one with the idiosyncratic reading here. My reading is certainly colored by my picture of the threat models. My concern for AIs at this capability level is primarily about individual or small groups of terrorists, I think security that screens off most opportunistic attackers is what we need to contain the threat, and the threat model you’re describing does not seem to me like it represents an appreciable increase in relevant risks (though it could at higher AI capability levels).
In any case, I will advocate for the next iteration of this policy to provide clarification or revision to better align with what is (in my opinion) important for the threat model.
FWIW, this is part of a general update for me that the level of specific detail in the current RSP is unlikely to be a good idea. It’s hard to be confident in advance of what will end up making the most sense from a risk reduction POV, following future progress on threat modeling, technical measures, etc., at the level of detail the current RSP has.
Thanks for the response!
Having talked to something like 5-15 people about this, many of whom had at least a non-trivial cybersecurity background, I am pretty confident that your reading is idiosyncratic!
The language models also seem to continue to think so:
Chat with Gemini Pro about RSP definitions
Me: Hey, can you help me clarify what Anthropic considers an “Insider” according to their RSP? Like, would employees or executives of datacenter providers count as “insiders”? I attached the latest version of their RSP.
Gemini Pro:
Based on the text of the Responsible Scaling Policy (Version 2.2), Anthropic differentiates between “Insiders” and third-party providers (such as datacenter staff).
Here is the breakdown of how the document defines these roles:
1. Definition of “Insider”
According to Page 13 and Footnote 11, Anthropic defines “insiders” specifically as internal employees.
The “In Scope” List (Page 13): The policy explicitly lists “internal employees” as a specific threat group alongside external groups like hacktivists and state-sponsored programs.
Insider Risk Classifications (Footnote 11): The document further breaks down “internal employees” into two categories based on their access to model weights:
Basic Insider: An insider who does not have persistent or time-limited access to systems that process model weights.
Sophisticated Insider: An insider who has persistent access or can request time-limited access to systems that process model weights.
2. Status of Datacenter Providers
Employees or executives of datacenter providers are not classified as “Insiders” under this policy. Instead, they fall under the categories of “Third-party environments,” “Supply chain,” or “Vendors.”
The RSP treats them as external entities that require their own specific set of safeguards:
Third-Party Environments (Page 12 & 14): The policy states that if models are deployed in a third-party partner’s environment (like a datacenter), Anthropic must “Document how all relevant models will meet the criteria… even if they are deployed in a third-party partner’s environment that may have a different set of safeguards.”
Supply Chain (Page 19, Appendix B): The ASL-2 Security Standard distinguishes “Supply chain” from “Workforce.”
Supply Chain: Involves “Vendor and supplier security” which must be regularly reviewed.
Workforce: Involves “internal employees” and the “insider risk program.”
Summary
To answer your question directly: No, employees or executives of datacenter providers would not count as “Insiders” according to this document.
“Insiders” = Anthropic internal staff/employees.
Datacenter Staff = Vendors/Third-party partners (subject to supply chain security reviews and third-party environment agreements).
Like, my understanding is that the definition of “sophisticated insider” you propose here includes on the order of 2,000-3,000 people, whereas when I read the RSP, and asked other people about it, they thought it would be on the order of 50-100 people. That is of course a huge difference in surface area.
I don’t think this change in surface area is the kind of change that should be left up to this much ambiguity in the RSP. I think even if you update that the level of specific detail in the current RSP is unlikely to be a good idea, I think you should be able to end up with less counterintuitive definitions and less ambiguity[1] in future revisions of the RSP.
I haven’t thought as much about all the tradeoffs as you have, so maybe this is infeasible for some reason, but I currently believe that this was a pretty clear and preventable error, instead of just a case of diverging interpretations (and to be clear, it’s OK for there to be some errors, I don’t think this thing alone should update anyone that much, though this plus a few other things should).
I appreciate it!
I think this is reasonable! I don’t think the current RSP communicates that super well, and I think “risk from competitor corporate espionage” is IMO a reasonable thing to be worried about, at least from an outside view[2]. It seems good for the RSP to be clear that it is currently not trying to be robust to at least major US competitors stealing model weights (which is I think a fine call to make given all the different tradeoffs).
Though given that I have not met a single non-Anthropic employee, or language model, who considered the definition of “Insider” you use here natural given the context of the rest of the document I struggle to call it “ambiguity” instead of simply calling it “wrong”
It is for example a thing that has come up in at least one scenario exercise game I have been part of, not too far from where current capability thresholds are at.
Reading the May 14 update, it looks like it describes adding the last paragraph of Habryka’s opening blockquote. If that’s right, he goes on to describe why this exclusion wouldn’t trigger here.
I agree but don’t feel very strongly. On Anthropic security, I feel even more sad about this.
As I say at the end, I don’t particularly care about Anthropic’s security commitments here, what I do care about is the RSP meaning anything at all!
And to be clear, my belief for a long time has been that the RSP was unlikely to have much predictive power over Anthropic’s priorities, so part of my motivation here is establishing common knowledge about that, so people can push on other governance approaches that aren’t relying on companies holding themselves to their RSPs.
Could you share the TL;DR for why this might be bad for the world?
It’s plausible to me that Anthropic (and other frontier labs) having bad security is good because it deflates race dynamics (if your competitors can just steal your weights after you invest $100b into a training run, you will probably think twice). Bad cybersecurity means you can’t capture as much of the economic value provided by a model.
Furthermore, “bad cybersecurity is the poor man’s auditing agreement”. If I am worried about a lab developing frontier models behind closed doors, then them having bad cybersecurity means other actors can use the stolen weights to check whether a model poses a national security risk to them, and intervene before it is too late.
Is this a terrible solution to auditing? Yes! Are we going to get something better by default? I really don’t know, I think not any time soon?