I’ll start by commenting on some random sections that I had thoughts on while reading. My high-level thoughts are in this other comment. For readers of this comment, feel free to skip any section, they are largely intended to stand on their own.
But it’s been easy to get the impression that the RSP is “binding ourselves to the mast” and committing to unilaterally pause AI development and deployment under some conditions, and Anthropic is responsible for that.
In almost every conversation I had with Anthropic employees about this, they explicitly said that in contrast to any potential vague promises that were made about not pushing forward the state of capabilities, that the RSP was very much such an explicit commitment, and I should absolutely hold Anthropic accountable to the thresholds outlined in that RSP.
Here for example is a conversation with Evan Hubinger. The conversation starts with someone saying:
One reason I’m critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it’s OK to keep going.
Evan responded with (across a few different comments, emphasis mine):
It’s hard to take anything else you’re saying seriously when you say things like this; it seems clear that you just haven’t read Anthropic’s RSP.
[...]
The conditions under which Anthropic commits to pausing in the RSP are very clear. In big bold font on the second page it says:
Anthropic’s commitment to follow the ASL scheme thus implies that we commit to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.
[...]
the security conditions could trigger a pause all on their own, and there is a commitment to develop conditions that will halt scaling after ASL-3 by the time ASL-3 is reached.
[...]
This is the basic substance of the RSP; I don’t understand how you could have possibly read it and missed this. I don’t want to be mean, but I am really disappointed in these sort of exceedingly lazy takes.
This was, in my experience, routine[1]. I therefore do see this switch from “RSP as concrete if-then-commitments” to “RSP as positive milestone setting” to constitute a meaningful breaking of a promise. Yes, the RSP always said in its exact words that Anthropic could revise it, but people who said that condition would trigger were frequently dismissed and insulted as in the comment above.
And to be clear, I think this is a huge deal! My experience interfacing with Anthropic on RSP-adjacent topics has been pretty universally terrible, with a very routine experience of being gaslit (with the exception of interfacing specifically with you, Holden, on this topic, where your comments have seemed clear and reasonable and consistent across time to me).
I am glad to see this post as a kind of reckoning with many of these bad implicit promises, but at some point Anthropic has failed so many times to set reasonable expectations, and has acted so many times adversarially to people trying to get clarity on commitments, that it becomes very hard to have any kind of non-adversarial relationship to it. I do think this post helps, and I hope it might open up better and less adversarial future communications.
I believe that Anthropic has achieved a level of robustness to jailbreaks of its models (for particular uses of concern, targeted by extra-robust classifiers) that it would not have achieved in the absence of something very much like the RSP.
My current belief is that the state of jailbreaking is “basically all models can be jailbroken with roughly 10 hours of effort”[2]. I do think this maybe satisfies the ASL-3 commitments made, but it seems 75%+ likely that jailbreaking will continue to be at most a moderate obstacle. In other words, anyone serious about building biological weapons currently is at most minorly inconvenienced by the need for jailbreaks (10 hours is an extremely small fraction of the time spent on executing any such project).
I feel like the right attitude towards the whole jailbreak thing is mostly to say “yep, we haven’t succeeded at getting jailbreak resistance that makes any substantial difference for any serious attack scenarios, and we are not on track to get there”. As such, listing this in the list of successes of the RSP seems weird to me.
Also, I might be wrong about jailbreak resistance! I have maybe spent like 5 hours thinking about this, which is really not that much, and possibly there has been more progress behind the scenes than I thought.
I think the RSP galvanized a concerted effort to achieve “ASL-3 security” across the company. In practice, this ended up meaning a large focus on a handful of particular security measures that were judged to be especially important or promising for this goal—especially egress bandwidth controls (described here).
[...]
(I also take responsibility for emphasizing this aspect of security in the past. I still think it is extremely important, but I have a better sense today of just how difficult it is to achieve model weight security against the strongest attackers, and this affects what I think is most worth prioritizing, as discussed in the podcast.)
[...]
The previous RSP didn’t give a lot of detail about the ASL-4 and ASL-5 standards, but I think it was generally understood to imply that we will need to protect our model weights from attacks from state-backed programs—and perhaps achieve jailbreak robustness against these as well—when we reach AI capabilities associated with “CBRN-4” and “AI R&D-5.” The company’s leadership expects a reasonable probability of capabilities that would likely cross these thresholds within the next 2 years.
At various events I was at, and conversations I had with people, Anthropic employees told me they were aiming to achieve robustness from state-backed hacking programs, and that they were ready to pause if they could not achieve that (as the RSP “committed” them to such things).
As someone with an amateur computer security background, this seemed absurd to me and I consistently told people (both online and at many events I went to) that achieving robustness to state-backed hacking programs was unachievable and extremely unlikely and that we should by default assume that Anthropic will change the ASL-4 security thresholds before they need to meet them (@ryan_greenblatt can I think attest to my repeated frustration on this topic).
This was met with broad dismissal by basically every Anthropic employee I talked to, and many people told me that I must simply be bad at thinking about cybersecurity given that I lack more extensive experience and credentials in the space and that I should defer to Jason Clinton who (I was told) was saying that Anthropic was aiming for that kind of robustness, and that Anthropic was on track to achieve it, and was still committed to meeting ASL-4 security standards before the relevant capability thresholds were crossed.
I don’t really have much more to say besides “I told you so” and “I really don’t know why any of you thought this was achievable”. Like, my complaint at the time wasn’t even that I think Anthropic would try too hard to stay competitive and so make bad tradeoffs on cybersecurity grounds. It just seemed clear that really from almost any worldview, the ASL-4 security commitments (and also some of the text of ASL-3 which I covered here) were not achievable and were a clear indicator that Anthropic would have to either implode itself, or drastically change the commitments, which wasn’t in anyone’s interest.
This does seem to me a pretty major strike against Anthropic’s ability to be well-calibrated about its ability to overcome difficult safety-related challenges in even pretty straightforward domains like cybersecurity (which is much easier to be well-calibrated in than e.g. knowing whether models have deep misaligned propensities). Like, the knowledge that cybersecurity has been universally offense-dominant for decades and that practically no large IT project has achieved state-robust cybersecurity is common knowledge in computer security circles, and the experience of going from those circles to talking to what appeared to me at the time vastly overconfident Anthropic employees was stark and suggests some kind of weird filter bubble must have happened.
IDK, I am still not super confident on this, but it does seem that people updated in my direction over the last few years, and my position has for a long time been that Anthropic’s security commitments have been very over-ambitious.
ETA: I do think this also caused substantial strategic mistakes among many people I talked to. A common conversation I had was people saying “it’s really important we beat China”, me saying “I mean, I disagree, but even if I did, China will just steal the weights if you make really powerful AI”, them saying “that can’t be right, Anthropic has committed to being robust to that, and if it’s doable, I highly doubt that OpenAI let their weights be stolen. It’s an ambitious project, but I don’t see a way around it”.
If there had been common knowledge (which I think there probably would have in the absence of Anthropic’s RSP and associated recruitment/marketing/comms efforts) that achieving nation-state robust cybersecurity was not achievable unless very drastic actions were taken, I do think this would have caused some people to change strategies substantially.
Yes, I think a unilateral pause could be a good idea, both on consequentialist and non-consequentialist grounds, under various future circumstances that seem like they could plausibly come about. I have a private document on what sorts of circumstances I think would call for this, which I may adapt into a public piece later.
But as discussed above, commitments are risky, and the prudence of making one depends a lot on how confident we are in being able to articulate a robustly good one. I have very low confidence in being able to articulate robust, operationalized circumstances under which a unilateral pause is a good idea. A lot of the key factors come down to things like “What is the political environment?” and “What evidence of risk can we provide, and who will and won’t find it compelling?”
Thank you for this section. I would be very interested in seeing you take your private document and making it public. I do think if we want good outcomes from AI, there will be a time when unilaterally pausing is the right choice (my guess is the ideal time was in the past, but I can see reasonable disagreement on this), and so trying to think through this seems very important.
To be clear, no one has published any easily developed universal jailbreaks (UK AISI published a procedure that produced one such jailbreak but which required many months of work), but query-specific jailbreaks can seemingly be achieved by anyone who tries medium hard and seem like the relevant comparison here. If universal jailbreaks were easy, then of course such jailbreaks would get published and then the marginal cost for someone to paste in such a jailbreak would be close to 0 hours.
with the exception of interfacing specifically with you, Holden, on this topic, where your comments have seemed clear and reasonable and consistent across time to me
I’m excited about RSPs partly because it seems like people in those categories—not just people who agree with my estimates about risks—should support RSPs. This raises the possibility of a much broader consensus around conditional pausing than I think is likely around immediate (unconditional) pausing. And with a broader consensus, I expect an easier time getting well-designed, well-enforced regulation.
I think RSPs represent an opportunity for wide consensus that pausing under certain conditions would be good, and this seems like it would be an extremely valuable thing to establish.
To me this language seems entirely consistent with my current position. In fact, if I had (at that time) been hoping for a voluntary pause, it would have been a strange choice for me to emphasis “consensus” leading to “regulation” when I could have just said something like “I’m excited about RSPs partly because it seems like people in those categories—not just people who agree with my estimates about risks—should support RSPs. This means that voluntary pausing based on conditional commitments is more likely than voluntary pausing today.”
I believe that all of my past writing on if-then commitments presents the theory of change as running through policy action, and generally discusses “escape clauses” as well, rather than focusing on voluntary action as the theory of change.
I just skimmed the piece, and it does seem consistent over time to me. Eg., under “Potential Benefits”, the piece does not list unilateral pause in case of unmitigated risks.
These are commitments of the form: If an AI model has capability X, risk mitigations Y must be in place. And, if needed, we will delay AI deployment and/or development to ensure the mitigations can be present in time.
This isn’t explicit about a unilateral pause, but I think it would be kinda weird if this meant to imply “And, if needed, we will delay [...] unless other groups are not also delaying.”
Voluntary commitments and even regulation could be too hard to enforce across the board—such that responsible actors end up adhering to if-then commitments, while irresponsible actors rush forward with dangerous AI. One of the challenges with AI is that complete enforcement of any given risk mitigation framework seems extremely hard to achieve, yet incomplete enforcement could end up disadvantaging responsible actors in a high-stakes, global technology race. This is a general issue with most ways of reducing AI risks, other than “race forward and hope that the benefits outweigh the costs,” and is not specific to if-then commitments.
To help mitigate this issue, early, voluntary if-then commitments can contain “escape clauses” along the lines of: “We may cease adhering to these commitments if some actor who is not adhering to them is close to building more capable models than ours.” (Some more detailed suggested language for such a commitment is provided by METR, a nonprofit that works on AI evaluations.)
To help mitigate this issue, early, voluntary if-then commitments can contain “escape clauses” along the lines of: “We may cease adhering to these commitments if some actor who is not adhering to them is close to building more capable models than ours.” (Some more detailed suggested language for such a commitment is provided by METR, a nonprofit that works on AI evaluations.)
Just for reference, this framing is what makes me feel fine about things you said on this topic, but not fine about conversations I’ve had with Anthropic employees about this topic in the last few years. My conversations with Anthropic employees did definitely not involve them saying “we are committing to our RSP only if every other company also adopts a similar RSP”.
At the most they were saying “we are going to revise our RSP as we learn more about what an effective RSP would look like and might make changes in-accordance with that”, which is of course drastically different. If the commitment all along had been to “commit to the RSP conditional on other people also committing to equivalent policies”, then the RSP could have said that directly, and the change from an unconditional to a conditional policy is of course massive (and I think the RSP as written clearly was communicating itself as an unconditional policy).
One interpretation for how Holden might have been consistent over time: He did not think that Anthropic should unilaterally pause AI development if other companies race ahead. But he did think the RSP should say that they’d pause when there are unmitigated risks regardless of the context and race-dynamics since saying so in the RSP is a good forcing function for the actual benefits that he wished would follow from it.
(Tbc., I do not know what Holden believed, I’m just constructing a plausible reality)
(Also, even then he at least seems to have changed his mind about whether writing down If-Then commitments is a good idea!)
Yes, in as much as the Anthropic RSP was intended as an implementation of if-then-commitments with specific ifs and thens, then that would be inconsistent. But IIRC Holden didn’t work at Anthropic at the time of the RSP getting developed or adopted, and I didn’t see any writing by Holden about the degree to which he does consider Anthropic committed to these thresholds, or see it as a clear instance of something that follows the shape of what is in that paper.
I was here mostly referencing specific conversations or interactions (in e.g. comment threads and emails) I had with Holden as opposed to others at Anthropic about the RSP.
I don’t really understand what point you think my quote is making for you. I was responding to a claim (not by you) that the RSP didn’t make clear under what conditions Anthropic would pause. But it very much did! It was very clear, right in the text! The argument that I was responding to was not “the RSP made those lines clear, but Anthropic might change the RSP such that the lines became different”. That is true, and indeed now Anthropic is changing the RSP—but that’s not the claim I was responding to, so my quote just seems basically unrelated to the point you’re making.
My actual position is indeed that you should downweight the theory of change of RSPs now. As I was always extremely clear in my post on this, the theory of change for RSPs heavily depends on them translating into regulation. That is now extremely unlikely to happen, so that particular theory of change doesn’t really work anymore, and thus RSP v3 is going for a transparency-based theory of change instead.
Please try a bit harder to engage with me and others here. I am pretty sure you can at least see the shape of the complaint.
The RSP did of course absolutely not make clear under what conditions Anthropic would actually pause[1], because the most crucial condition that in the end actually mattered was “whether Anthropic leadership actually thinks it’s a good idea to pause” (because if they don’t think it’s a good idea they will just change the RSP).
Many people expressed concerns about this! This concern is included in the conversation above, and in my follow-up conversation in the thread.
You have not “committed to pausing when condition X triggers” if you can simply change the commitment any time! That’s not how commitments work.
The appropriate thing to say at the time, given how you seem to be relating to the RSP now, would have been:
It seems you have misread the RSP, as the RSP does not commit Anthropic to anything. We maintain full right to change our policies. You should view the RSP as an internal company document we are releasing for the sake of transparency, but we are not making any promises about how we will relate to it in the future.
We are currently intending to follow these guidelines, but this heavily depends on whether other actors in the ecosystem will do so too (or whether we change our mind in any other way). We do promise to announce changes to these guidelines as they happen (though we do not promise any particular notice-periods, and should expect that changes we announce go into effect immediately), and for those changes to route through our Long Term Benefit Trust, but that is really the only thing you should consider ourselves to be bound by.
So no, the RSP outlines no conditions under which Anthropic commits to pause. That is simply not the kind of document it is. It does not aim to provide clarity on the question of “under what conditions Anthropic would actually pause”.
But you did not say that! Indeed you really said something extremely close to the opposite of this.
I mean, just imagine this in any other contexts. Imagine I am talking with a user worried about how we treat their private user data on LessWrong:
Them: “hey, an issue with your privacy policy is that it doesn’t actually help me understand how likely it is for my DMs to actually be sent to third parties”.
Me: “What are you talking about? Our privacy policy is very clear that we commit to never send your DMs to third parties. I honestly have no idea how you failed to miss that. It’s RIGHT THERE. This doesn’t seem worth engaging with given that you don’t seem to have even read our privacy policy at all”.
Then a week later I trigger sub-entry 58 of the privacy policy which says I can change the privacy policy any time, and send all of their DMs to random third parties.
The user gets upset and says “you said you committed to never send my DMs to third parties”! I say “I really have no idea how you think that conversation might be relevant. I was responding to a question by you about what our privacy policy says, I wasn’t responding to a question from you about how our privacy policy might change in the future. I changed our privacy policy so it doesn’t say that anymore. My quote about ‘committing to never send DMs to third parties’ seems basically unrelated to the point you are making.”.
Since we now have the technology, I threw all of the relevant comments in these threads into Claude, with just the prompt “Hey, what is your assessment of the situation described in these comment threads?”, and no special memory or system prompt.
This is a pretty interesting and somewhat uncomfortable exchange to read. Here’s my assessment:
The substantive 2023 disagreement. In the original thread, habryka and evhub were having a genuine semantic/substantive disagreement. Habryka’s point — that the security conditions weren’t really empirical tripwires that could force a pause based on model behavior, but rather a checklist Anthropic could implement at will — was a reasonable and important critique. Evhub was technically correct that the RSP did specify conditions under which Anthropic committed to pause, but habryka was making the deeper point that these conditions didn’t meaningfully constrain Anthropic’s scaling decisions in practice. I think habryka had the better end of this argument, even in 2023.
The tone problem. Evhub’s 2023 framing — “it seems clear that you just haven’t read Anthropic’s RSP,” “exceedingly lazy takes” — was quite aggressive for what turned out to be a case where the critic’s skepticism was more warranted than the defender’s confidence. When you forcefully dismiss concerns and then the thing the critics were worried about essentially happens, that’s worth reckoning with more fully than evhub’s 2026 response does.
Evhub’s 2026 defense. The argument that “I was responding to the claim that the text wasn’t clear, not the claim that Anthropic might change it” feels like a narrow technical dodge. The spirit of the 2023 exchange was clearly about whether the RSP constituted a meaningful, reliable commitment to pause. Evhub was defending it as such — not merely as a document with legible text. Habryka’s point in citing the exchange seems fair: Anthropic employees used the RSP’s specificity to dismiss concerns, and then the RSP was revised in the direction critics feared.
The bigger picture. The arc from “the RSP has very clear conditions and you’re lazy for questioning them” to “the theory of change for RSPs depended on regulation that isn’t happening, so you should downweight them” is a significant shift. It vindicates the skeptics who argued that voluntary corporate commitments, however clearly worded, aren’t reliable constraints — which was arguably the deeper concern all along.
Holden’s post comes across as the most forthright contribution here, openly acknowledging the shift and taking responsibility for it. The whole episode is a cautionary tale about the gap between the text of voluntary commitments and the reliability of those commitments as actual constraints on behavior.
I think Claude’s summary is decent-ish here. I think if Claude can see at least the shape of this whole concern, so can you.
There even was an “actually” right there in the original question! IMO the original question was clearly about the degree to which the RSP provides actual guidance about future Anthropic’s actions, not just the narrower question of “under the assumption that Anthropic leadership doesn’t change the RSP, what does it say about Anthropic’s behavior?”. The Manifold market you and I made also does not have any “unless Anthropic changes the RSP commitments” clause in it.
Hmmm.. I think I took Evan’s opening line “I don’t really understand what point you think my quote is making” very literally. On rereading, I think Evan does understand the point you think his quote is making, but disagrees that it makes that point (I haven’t spoken with Evan enough to know how literal he is).
(And I just agree with you that if someone uses the word commit, they shouldn’t be locally redefining it to mean ‘commit unless we change our mind’, in particular when responding to a question that opens with the phrase ‘actually commit’.)
My current belief is that the state of jailbreaking is “basically all models can be jailbroken with roughly 10 hours of effort”[1]. I do think this maybe satisfies the ASL-3 commitments made, but it seems 75%+ likely that jailbreaking will continue to be at most a moderate obstacle. In other words, anyone serious about building biological weapons currently is at most minorly inconvenienced by the need for jailbreaks (10 hours is an extremely small fraction of the time spent on executing any such project).
The tweet you link to here is targeting a chemical weapon, not a biological weapon, which was explicitly out of scope per footnote 3 on the post on ASL-3 protections at the time. I don’t think sarin synthesis instructions are much evidence either way here.
I would guess that getting similarly helpful uplift for eg weaponizing dangerous pathogens would be significantly harder, especially with the classifiers of 2026 which are a lot better, but I agree the universal jailbreak focus isn’t that well justified for a threat model where misuse actors have in-house jailbreaking effort. Would be pretty interested if you know of evidence that people can get single-topic bio jailbreaks past current classifiers without extremely large amounts of effort!
Would be pretty interested if you know of evidence that people can get single-topic bio jailbreaks past current classifiers without extremely large amounts of effort!
I thought the above was one such example (and I looked in all the places where it was posted to find someone saying something like that), so maybe I am wrong!
Possible that I am wrong about this. I’ll think a bit more in the following days.
Drake is right, sorry for the confusion. We were not intentionally misleading – we missed a footnote on the announcement when putting together the initial tweet thread that narrowed the claim to just bio rather than CBRN, as discussed in the rest of the announcement. We did later find a vulnerability that allowed us to bypass the filters in the bio setting, reported it, and it was patched. I think that follow-up work took more on the order of 40 person-hours, but was a general method that could extract information in a range of settings. I don’t know how likely it is that there are further such vulnerabilities.
Even if single-query jailbreaking was O(10) hours though, having to send many queries to discover that jailbreak makes it much easier to catch through monitoring.
Do query-specific jailbreaks require knowledge of the ground-truth? If you use that high-profile jailbreak which generated a plausible-sounding sarin recipe[1]---which iiuc worked by asking the model to help the user write a short story/article containing of the diary of a chemical terrorist, or something like that—do you actually get a workable sarin recipe, or do you get a procedure which actually fails/explodes/gets you caught easily because you tripped every single government watchlist filter at the same time?[2] What if you get your jailbreak slightly wrong? In the process of getting the right answer, will you get half a dozen fake answers that you can’t distinguish between?
robustness to state-backed hacking programs was unachievable
How do you reconcile that with the fact that Claude has recently been used by the US Government to process classified information? Presumably they have a special version on special servers for that but still, this looks like some degree of robustness which might be achieved with a model not served to a wide audience
I believe these things are connected with each other: if the server and the software system in general is safe enough to work with lots of classified information on a regular basis, it’s safe to store the weights as well
Didn’t top secret US government networks have breaches by the Chinese before when the stakes where probably lower? Are you thinking that those networks are much more secure now than they were a decade ago?
The deployment for top secret government networks probably is reasonably secure. The problem is that they also store those weights in a bunch of other data centers that are necessarily connected to the internet, and the only way to not do that would be to shut down their consumer product and lose 99% of their income.
I agree that the models served to civilian customers over API can’t be realistically secured from the state adversaries, but if we are speaking about advanced AI R&D in the future like in AI 2027, than it looks feasible to conduct it on protected servers. Maybe I misunderstood author’s opinion
No, it’s not at the moment feasible, without building infrastructure that would halt frontier training progress at the relevant company for months, if not years, to keep weights limited to servers with nation-state level robustness.
Nation states steal classified info on a fairly regular basis, right? I’m not familiar enough with the field to have a definite opinion, but it’s not obvious to me that ASL-4 security is achievable with the controls we normally use on Secret or even Top Secret info.
If there had been common knowledge (which I think there probably would have in the absence of Anthropic’s RSP and associated recruitment/marketing/comms efforts) that achieving nation-state robust cybersecurity was not achievable unless very drastic actions were taken, I do think this would have caused some people to change strategies substantially.
My sense is most efforts which seem to be aimed at nation-state robust cybersecurity for AI (outside labs) have been driven by things like the RAND report and some theory of change downstream of pieces like Situational Awareness which argue that governments may push for TS/SCI classified AI development at some point (for some applications)—not necessarily downstream of Anthropic’s RSP (I’ve never heard anyone mention it directly in, for example, any discussion around why SL5 security is important).
That said (a) I agree with you that this goal is impractial for general AI development and it was foolish of Anthropic to commit to something close to security against state-backed attacks (b) I have heard critique of Jason Clinton’s PoV from parts of the AI cybersecurity community that, having never worked in an Intelligence Community cybersecurity role, he lacks information that would update him on the difficulties (c) I remain confused why some still consider SL5 security for AI model weights a tractable or important goal in the field.
not necessarily downstream of Anthropic’s RSP (I’ve never heard anyone mention it directly in, for example, any discussion around why SL5 security is important).
I don’t know what your background is, but to be clear, I am saying that many people who do not work in computer security have made bad strategic decisions because of this. I don’t think almost anyone working in computer security was super misled here.
Maybe our experiences still diverge with that clarification, but it seemed good to check before I dig deeper.
That’s fair! I am mostly thinking of the AI safety community and the parts of it interested in cybersecurity. I had a lot of discussions with people around the funding ecosystem/government AI safety-interested people/AI policy thinktanks around a year ago about the merits of attempting SL5 and never heard a mention of Anthropic’s RSP specifically, although it seems plausible it was a contributing factor for decisions to pursue that direction.
I was thinking here of talking to various people in EA leadership and various other people working in AI safety research (including some people at labs).
I did go to the AI security forum 2 years ago or so and the vibe I got was that people also made some bad strategic decisions at the time vaguely related to this, but people seemed better calibrated than the people who didn’t have any security interests.
I overall appreciate this post a lot.
I’ll start by commenting on some random sections that I had thoughts on while reading. My high-level thoughts are in this other comment. For readers of this comment, feel free to skip any section, they are largely intended to stand on their own.
Yes, Anthropic employees on more than a dozen occasions told me that the RSP binds them to a mast. I had many very explicit conversations with many Anthropic employees about this, because I was following up on what I thought was Anthropic violating what I perceived to be a promise to not push forward the state of AI capabilities, which many employees disputed had happened.
In almost every conversation I had with Anthropic employees about this, they explicitly said that in contrast to any potential vague promises that were made about not pushing forward the state of capabilities, that the RSP was very much such an explicit commitment, and I should absolutely hold Anthropic accountable to the thresholds outlined in that RSP.
Here for example is a conversation with Evan Hubinger. The conversation starts with someone saying:
Evan responded with (across a few different comments, emphasis mine):
This was, in my experience, routine[1]. I therefore do see this switch from “RSP as concrete if-then-commitments” to “RSP as positive milestone setting” to constitute a meaningful breaking of a promise. Yes, the RSP always said in its exact words that Anthropic could revise it, but people who said that condition would trigger were frequently dismissed and insulted as in the comment above.
And to be clear, I think this is a huge deal! My experience interfacing with Anthropic on RSP-adjacent topics has been pretty universally terrible, with a very routine experience of being gaslit (with the exception of interfacing specifically with you, Holden, on this topic, where your comments have seemed clear and reasonable and consistent across time to me).
I am glad to see this post as a kind of reckoning with many of these bad implicit promises, but at some point Anthropic has failed so many times to set reasonable expectations, and has acted so many times adversarially to people trying to get clarity on commitments, that it becomes very hard to have any kind of non-adversarial relationship to it. I do think this post helps, and I hope it might open up better and less adversarial future communications.
My current belief is that the state of jailbreaking is “basically all models can be jailbroken with roughly 10 hours of effort”[2]. I do think this maybe satisfies the ASL-3 commitments made, but it seems 75%+ likely that jailbreaking will continue to be at most a moderate obstacle. In other words, anyone serious about building biological weapons currently is at most minorly inconvenienced by the need for jailbreaks (10 hours is an extremely small fraction of the time spent on executing any such project).
I feel like the right attitude towards the whole jailbreak thing is mostly to say “yep, we haven’t succeeded at getting jailbreak resistance that makes any substantial difference for any serious attack scenarios, and we are not on track to get there”. As such, listing this in the list of successes of the RSP seems weird to me.
Also, I might be wrong about jailbreak resistance! I have maybe spent like 5 hours thinking about this, which is really not that much, and possibly there has been more progress behind the scenes than I thought.
At various events I was at, and conversations I had with people, Anthropic employees told me they were aiming to achieve robustness from state-backed hacking programs, and that they were ready to pause if they could not achieve that (as the RSP “committed” them to such things).
As someone with an amateur computer security background, this seemed absurd to me and I consistently told people (both online and at many events I went to) that achieving robustness to state-backed hacking programs was unachievable and extremely unlikely and that we should by default assume that Anthropic will change the ASL-4 security thresholds before they need to meet them (@ryan_greenblatt can I think attest to my repeated frustration on this topic).
This was met with broad dismissal by basically every Anthropic employee I talked to, and many people told me that I must simply be bad at thinking about cybersecurity given that I lack more extensive experience and credentials in the space and that I should defer to Jason Clinton who (I was told) was saying that Anthropic was aiming for that kind of robustness, and that Anthropic was on track to achieve it, and was still committed to meeting ASL-4 security standards before the relevant capability thresholds were crossed.
I don’t really have much more to say besides “I told you so” and “I really don’t know why any of you thought this was achievable”. Like, my complaint at the time wasn’t even that I think Anthropic would try too hard to stay competitive and so make bad tradeoffs on cybersecurity grounds. It just seemed clear that really from almost any worldview, the ASL-4 security commitments (and also some of the text of ASL-3 which I covered here) were not achievable and were a clear indicator that Anthropic would have to either implode itself, or drastically change the commitments, which wasn’t in anyone’s interest.
This does seem to me a pretty major strike against Anthropic’s ability to be well-calibrated about its ability to overcome difficult safety-related challenges in even pretty straightforward domains like cybersecurity (which is much easier to be well-calibrated in than e.g. knowing whether models have deep misaligned propensities). Like, the knowledge that cybersecurity has been universally offense-dominant for decades and that practically no large IT project has achieved state-robust cybersecurity is common knowledge in computer security circles, and the experience of going from those circles to talking to what appeared to me at the time vastly overconfident Anthropic employees was stark and suggests some kind of weird filter bubble must have happened.
IDK, I am still not super confident on this, but it does seem that people updated in my direction over the last few years, and my position has for a long time been that Anthropic’s security commitments have been very over-ambitious.
ETA: I do think this also caused substantial strategic mistakes among many people I talked to. A common conversation I had was people saying “it’s really important we beat China”, me saying “I mean, I disagree, but even if I did, China will just steal the weights if you make really powerful AI”, them saying “that can’t be right, Anthropic has committed to being robust to that, and if it’s doable, I highly doubt that OpenAI let their weights be stolen. It’s an ambitious project, but I don’t see a way around it”.
If there had been common knowledge (which I think there probably would have in the absence of Anthropic’s RSP and associated recruitment/marketing/comms efforts) that achieving nation-state robust cybersecurity was not achievable unless very drastic actions were taken, I do think this would have caused some people to change strategies substantially.
Thank you for this section. I would be very interested in seeing you take your private document and making it public. I do think if we want good outcomes from AI, there will be a time when unilaterally pausing is the right choice (my guess is the ideal time was in the past, but I can see reasonable disagreement on this), and so trying to think through this seems very important.
(My broader reflections are in their own comment)
As another small datapoint: Zac Hatfield-Dodds reacted with an “agree’ react to Evan’s comment, indicating he agreed with Evan’s response and framing
To be clear, no one has published any easily developed universal jailbreaks (UK AISI published a procedure that produced one such jailbreak but which required many months of work), but query-specific jailbreaks can seemingly be achieved by anyone who tries medium hard and seem like the relevant comparison here. If universal jailbreaks were easy, then of course such jailbreaks would get published and then the marginal cost for someone to paste in such a jailbreak would be close to 0 hours.
This doesn’t seem “consistent across time” to me, given that Holden is the author of a report called If-Then Commitments for AI Risk Reduction
Also, from We’re Not Ready: thoughts on “pausing” and responsible scaling policies:
To me this language seems entirely consistent with my current position. In fact, if I had (at that time) been hoping for a voluntary pause, it would have been a strange choice for me to emphasis “consensus” leading to “regulation” when I could have just said something like “I’m excited about RSPs partly because it seems like people in those categories—not just people who agree with my estimates about risks—should support RSPs. This means that voluntary pausing based on conditional commitments is more likely than voluntary pausing today.”
I believe that all of my past writing on if-then commitments presents the theory of change as running through policy action, and generally discusses “escape clauses” as well, rather than focusing on voluntary action as the theory of change.
I just skimmed the piece, and it does seem consistent over time to me. Eg., under “Potential Benefits”, the piece does not list unilateral pause in case of unmitigated risks.
That’s a good point. It does say:
This isn’t explicit about a unilateral pause, but I think it would be kinda weird if this meant to imply “And, if needed, we will delay [...] unless other groups are not also delaying.”
Note this quote later in that same piece:
Voluntary commitments and even regulation could be too hard to enforce across the board—such that responsible actors end up adhering to if-then commitments, while irresponsible actors rush forward with dangerous AI. One of the challenges with AI is that complete enforcement of any given risk mitigation framework seems extremely hard to achieve, yet incomplete enforcement could end up disadvantaging responsible actors in a high-stakes, global technology race. This is a general issue with most ways of reducing AI risks, other than “race forward and hope that the benefits outweigh the costs,” and is not specific to if-then commitments.
To help mitigate this issue, early, voluntary if-then commitments can contain “escape clauses” along the lines of: “We may cease adhering to these commitments if some actor who is not adhering to them is close to building more capable models than ours.” (Some more detailed suggested language for such a commitment is provided by METR, a nonprofit that works on AI evaluations.)
Just for reference, this framing is what makes me feel fine about things you said on this topic, but not fine about conversations I’ve had with Anthropic employees about this topic in the last few years. My conversations with Anthropic employees did definitely not involve them saying “we are committing to our RSP only if every other company also adopts a similar RSP”.
At the most they were saying “we are going to revise our RSP as we learn more about what an effective RSP would look like and might make changes in-accordance with that”, which is of course drastically different. If the commitment all along had been to “commit to the RSP conditional on other people also committing to equivalent policies”, then the RSP could have said that directly, and the change from an unconditional to a conditional policy is of course massive (and I think the RSP as written clearly was communicating itself as an unconditional policy).
One interpretation for how Holden might have been consistent over time: He did not think that Anthropic should unilaterally pause AI development if other companies race ahead. But he did think the RSP should say that they’d pause when there are unmitigated risks regardless of the context and race-dynamics since saying so in the RSP is a good forcing function for the actual benefits that he wished would follow from it.
(Tbc., I do not know what Holden believed, I’m just constructing a plausible reality)
(Also, even then he at least seems to have changed his mind about whether writing down If-Then commitments is a good idea!)
Yes, in as much as the Anthropic RSP was intended as an implementation of if-then-commitments with specific ifs and thens, then that would be inconsistent. But IIRC Holden didn’t work at Anthropic at the time of the RSP getting developed or adopted, and I didn’t see any writing by Holden about the degree to which he does consider Anthropic committed to these thresholds, or see it as a clear instance of something that follows the shape of what is in that paper.
I was here mostly referencing specific conversations or interactions (in e.g. comment threads and emails) I had with Holden as opposed to others at Anthropic about the RSP.
I don’t really understand what point you think my quote is making for you. I was responding to a claim (not by you) that the RSP didn’t make clear under what conditions Anthropic would pause. But it very much did! It was very clear, right in the text! The argument that I was responding to was not “the RSP made those lines clear, but Anthropic might change the RSP such that the lines became different”. That is true, and indeed now Anthropic is changing the RSP—but that’s not the claim I was responding to, so my quote just seems basically unrelated to the point you’re making.
My actual position is indeed that you should downweight the theory of change of RSPs now. As I was always extremely clear in my post on this, the theory of change for RSPs heavily depends on them translating into regulation. That is now extremely unlikely to happen, so that particular theory of change doesn’t really work anymore, and thus RSP v3 is going for a transparency-based theory of change instead.
Please try a bit harder to engage with me and others here. I am pretty sure you can at least see the shape of the complaint.
The RSP did of course absolutely not make clear under what conditions Anthropic would actually pause[1], because the most crucial condition that in the end actually mattered was “whether Anthropic leadership actually thinks it’s a good idea to pause” (because if they don’t think it’s a good idea they will just change the RSP).
Many people expressed concerns about this! This concern is included in the conversation above, and in my follow-up conversation in the thread.
You have not “committed to pausing when condition X triggers” if you can simply change the commitment any time! That’s not how commitments work.
The appropriate thing to say at the time, given how you seem to be relating to the RSP now, would have been:
But you did not say that! Indeed you really said something extremely close to the opposite of this.
I mean, just imagine this in any other contexts. Imagine I am talking with a user worried about how we treat their private user data on LessWrong:
Them: “hey, an issue with your privacy policy is that it doesn’t actually help me understand how likely it is for my DMs to actually be sent to third parties”.
Me: “What are you talking about? Our privacy policy is very clear that we commit to never send your DMs to third parties. I honestly have no idea how you failed to miss that. It’s RIGHT THERE. This doesn’t seem worth engaging with given that you don’t seem to have even read our privacy policy at all”.
Then a week later I trigger sub-entry 58 of the privacy policy which says I can change the privacy policy any time, and send all of their DMs to random third parties.
The user gets upset and says “you said you committed to never send my DMs to third parties”! I say “I really have no idea how you think that conversation might be relevant. I was responding to a question by you about what our privacy policy says, I wasn’t responding to a question from you about how our privacy policy might change in the future. I changed our privacy policy so it doesn’t say that anymore. My quote about ‘committing to never send DMs to third parties’ seems basically unrelated to the point you are making.”.
Since we now have the technology, I threw all of the relevant comments in these threads into Claude, with just the prompt “Hey, what is your assessment of the situation described in these comment threads?”, and no special memory or system prompt.
This is what Claude said:
I think Claude’s summary is decent-ish here. I think if Claude can see at least the shape of this whole concern, so can you.
There even was an “actually” right there in the original question! IMO the original question was clearly about the degree to which the RSP provides actual guidance about future Anthropic’s actions, not just the narrower question of “under the assumption that Anthropic leadership doesn’t change the RSP, what does it say about Anthropic’s behavior?”. The Manifold market you and I made also does not have any “unless Anthropic changes the RSP commitments” clause in it.
Put all the weight on the word ‘commits’ in the exchange and I think you’ll see Oli’s read! It took me a second to see it, too.
(I don’t understand this comment and don’t currently think it’s a helpful contribution)
Hmmm.. I think I took Evan’s opening line “I don’t really understand what point you think my quote is making” very literally. On rereading, I think Evan does understand the point you think his quote is making, but disagrees that it makes that point (I haven’t spoken with Evan enough to know how literal he is).
(And I just agree with you that if someone uses the word commit, they shouldn’t be locally redefining it to mean ‘commit unless we change our mind’, in particular when responding to a question that opens with the phrase ‘actually commit’.)
Retracted!
The tweet you link to here is targeting a chemical weapon, not a biological weapon, which was explicitly out of scope per footnote 3 on the post on ASL-3 protections at the time. I don’t think sarin synthesis instructions are much evidence either way here.
I would guess that getting similarly helpful uplift for eg weaponizing dangerous pathogens would be significantly harder, especially with the classifiers of 2026 which are a lot better, but I agree the universal jailbreak focus isn’t that well justified for a threat model where misuse actors have in-house jailbreaking effort. Would be pretty interested if you know of evidence that people can get single-topic bio jailbreaks past current classifiers without extremely large amounts of effort!
I thought the above was one such example (and I looked in all the places where it was posted to find someone saying something like that), so maybe I am wrong!
Possible that I am wrong about this. I’ll think a bit more in the following days.
Drake is right, sorry for the confusion. We were not intentionally misleading – we missed a footnote on the announcement when putting together the initial tweet thread that narrowed the claim to just bio rather than CBRN, as discussed in the rest of the announcement. We did later find a vulnerability that allowed us to bypass the filters in the bio setting, reported it, and it was patched. I think that follow-up work took more on the order of 40 person-hours, but was a general method that could extract information in a range of settings. I don’t know how likely it is that there are further such vulnerabilities.
Even if single-query jailbreaking was O(10) hours though, having to send many queries to discover that jailbreak makes it much easier to catch through monitoring.
Do query-specific jailbreaks require knowledge of the ground-truth? If you use that high-profile jailbreak which generated a plausible-sounding sarin recipe[1]---which iiuc worked by asking the model to help the user write a short story/article containing of the diary of a chemical terrorist, or something like that—do you actually get a workable sarin recipe, or do you get a procedure which actually fails/explodes/gets you caught easily because you tripped every single government watchlist filter at the same time?[2] What if you get your jailbreak slightly wrong? In the process of getting the right answer, will you get half a dozen fake answers that you can’t distinguish between?
Some of it seems silly to me just reading it: weeks 1-2? For a chlorination step in an ice bath? That’s like, an afternoon of work, max.
I’m pretty sure that you can’t “just order” the relevant chemicals for sarin, in particular.
This was Pliny’s response when I asked them if they can get around the classifiers. I’m not fully confident this counts, but Pliny seems to think so
Can you expand on what is there in this post that you appreciate?
How do you reconcile that with the fact that Claude has recently been used by the US Government to process classified information? Presumably they have a special version on special servers for that but still, this looks like some degree of robustness which might be achieved with a model not served to a wide audience
I think this is referring to protecting Claude’s weights from being stolen by a state-backed hacker, not about making Claude usable by governments.
I believe these things are connected with each other: if the server and the software system in general is safe enough to work with lots of classified information on a regular basis, it’s safe to store the weights as well
Didn’t top secret US government networks have breaches by the Chinese before when the stakes where probably lower? Are you thinking that those networks are much more secure now than they were a decade ago?
The deployment for top secret government networks probably is reasonably secure. The problem is that they also store those weights in a bunch of other data centers that are necessarily connected to the internet, and the only way to not do that would be to shut down their consumer product and lose 99% of their income.
I agree that the models served to civilian customers over API can’t be realistically secured from the state adversaries, but if we are speaking about advanced AI R&D in the future like in AI 2027, than it looks feasible to conduct it on protected servers. Maybe I misunderstood author’s opinion
No, it’s not at the moment feasible, without building infrastructure that would halt frontier training progress at the relevant company for months, if not years, to keep weights limited to servers with nation-state level robustness.
Nation states steal classified info on a fairly regular basis, right? I’m not familiar enough with the field to have a definite opinion, but it’s not obvious to me that ASL-4 security is achievable with the controls we normally use on Secret or even Top Secret info.
My sense is most efforts which seem to be aimed at nation-state robust cybersecurity for AI (outside labs) have been driven by things like the RAND report and some theory of change downstream of pieces like Situational Awareness which argue that governments may push for TS/SCI classified AI development at some point (for some applications)—not necessarily downstream of Anthropic’s RSP (I’ve never heard anyone mention it directly in, for example, any discussion around why SL5 security is important).
That said (a) I agree with you that this goal is impractial for general AI development and it was foolish of Anthropic to commit to something close to security against state-backed attacks (b) I have heard critique of Jason Clinton’s PoV from parts of the AI cybersecurity community that, having never worked in an Intelligence Community cybersecurity role, he lacks information that would update him on the difficulties (c) I remain confused why some still consider SL5 security for AI model weights a tractable or important goal in the field.
I don’t know what your background is, but to be clear, I am saying that many people who do not work in computer security have made bad strategic decisions because of this. I don’t think almost anyone working in computer security was super misled here.
Maybe our experiences still diverge with that clarification, but it seemed good to check before I dig deeper.
That’s fair! I am mostly thinking of the AI safety community and the parts of it interested in cybersecurity. I had a lot of discussions with people around the funding ecosystem/government AI safety-interested people/AI policy thinktanks around a year ago about the merits of attempting SL5 and never heard a mention of Anthropic’s RSP specifically, although it seems plausible it was a contributing factor for decisions to pursue that direction.
I was thinking here of talking to various people in EA leadership and various other people working in AI safety research (including some people at labs).
I did go to the AI security forum 2 years ago or so and the vibe I got was that people also made some bad strategic decisions at the time vaguely related to this, but people seemed better calibrated than the people who didn’t have any security interests.