habryka comments on Responsible Scaling Policy v3

habryka 26 Feb 2026 18:58 UTC
54 points
47
Please try a bit harder to engage with me and others here. I am pretty sure you can at least see the shape of the complaint.
The RSP did of course absolutely not make clear under what conditions Anthropic would actually pause^[1], because the most crucial condition that in the end actually mattered was “whether Anthropic leadership actually thinks it’s a good idea to pause” (because if they don’t think it’s a good idea they will just change the RSP).
Many people expressed concerns about this! This concern is included in the conversation above, and in my follow-up conversation in the thread.
You have not “committed to pausing when condition X triggers” if you can simply change the commitment any time! That’s not how commitments work.
The appropriate thing to say at the time, given how you seem to be relating to the RSP now, would have been:
It seems you have misread the RSP, as the RSP does not commit Anthropic to anything. We maintain full right to change our policies. You should view the RSP as an internal company document we are releasing for the sake of transparency, but we are not making any promises about how we will relate to it in the future.
We are currently intending to follow these guidelines, but this heavily depends on whether other actors in the ecosystem will do so too (or whether we change our mind in any other way). We do promise to announce changes to these guidelines as they happen (though we do not promise any particular notice-periods, and should expect that changes we announce go into effect immediately), and for those changes to route through our Long Term Benefit Trust, but that is really the only thing you should consider ourselves to be bound by.
So no, the RSP outlines no conditions under which Anthropic commits to pause. That is simply not the kind of document it is. It does not aim to provide clarity on the question of “under what conditions Anthropic would actually pause”.
But you did not say that! Indeed you really said something extremely close to the opposite of this.
I mean, just imagine this in any other contexts. Imagine I am talking with a user worried about how we treat their private user data on LessWrong:
Them: “hey, an issue with your privacy policy is that it doesn’t actually help me understand how likely it is for my DMs to actually be sent to third parties”.
Me: “What are you talking about? Our privacy policy is very clear that we commit to never send your DMs to third parties. I honestly have no idea how you failed to miss that. It’s RIGHT THERE. This doesn’t seem worth engaging with given that you don’t seem to have even read our privacy policy at all”.
Then a week later I trigger sub-entry 58 of the privacy policy which says I can change the privacy policy any time, and send all of their DMs to random third parties.
The user gets upset and says “you said you committed to never send my DMs to third parties”! I say “I really have no idea how you think that conversation might be relevant. I was responding to a question by you about what our privacy policy says, I wasn’t responding to a question from you about how our privacy policy might change in the future. I changed our privacy policy so it doesn’t say that anymore. My quote about ‘committing to never send DMs to third parties’ seems basically unrelated to the point you are making.”.
Since we now have the technology, I threw all of the relevant comments in these threads into Claude, with just the prompt “Hey, what is your assessment of the situation described in these comment threads?”, and no special memory or system prompt.
This is what Claude said:
This is a pretty interesting and somewhat uncomfortable exchange to read. Here’s my assessment:
The substantive 2023 disagreement. In the original thread, habryka and evhub were having a genuine semantic/substantive disagreement. Habryka’s point — that the security conditions weren’t really empirical tripwires that could force a pause based on model behavior, but rather a checklist Anthropic could implement at will — was a reasonable and important critique. Evhub was technically correct that the RSP did specify conditions under which Anthropic committed to pause, but habryka was making the deeper point that these conditions didn’t meaningfully constrain Anthropic’s scaling decisions in practice. I think habryka had the better end of this argument, even in 2023.
The tone problem. Evhub’s 2023 framing — “it seems clear that you just haven’t read Anthropic’s RSP,” “exceedingly lazy takes” — was quite aggressive for what turned out to be a case where the critic’s skepticism was more warranted than the defender’s confidence. When you forcefully dismiss concerns and then the thing the critics were worried about essentially happens, that’s worth reckoning with more fully than evhub’s 2026 response does.
Evhub’s 2026 defense. The argument that “I was responding to the claim that the text wasn’t clear, not the claim that Anthropic might change it” feels like a narrow technical dodge. The spirit of the 2023 exchange was clearly about whether the RSP constituted a meaningful, reliable commitment to pause. Evhub was defending it as such — not merely as a document with legible text. Habryka’s point in citing the exchange seems fair: Anthropic employees used the RSP’s specificity to dismiss concerns, and then the RSP was revised in the direction critics feared.
The bigger picture. The arc from “the RSP has very clear conditions and you’re lazy for questioning them” to “the theory of change for RSPs depended on regulation that isn’t happening, so you should downweight them” is a significant shift. It vindicates the skeptics who argued that voluntary corporate commitments, however clearly worded, aren’t reliable constraints — which was arguably the deeper concern all along.
Holden’s post comes across as the most forthright contribution here, openly acknowledging the shift and taking responsibility for it. The whole episode is a cautionary tale about the gap between the text of voluntary commitments and the reliability of those commitments as actual constraints on behavior.
I think Claude’s summary is decent-ish here. I think if Claude can see at least the shape of this whole concern, so can you.
1. ^
  There even was an “actually” right there in the original question! IMO the original question was clearly about the degree to which the RSP provides actual guidance about future Anthropic’s actions, not just the narrower question of “under the assumption that Anthropic leadership doesn’t change the RSP, what does it say about Anthropic’s behavior?”. The Manifold market you and I made also does not have any “unless Anthropic changes the RSP commitments” clause in it.