habryka comments on Responsible Scaling Policy v3

habryka 14 Mar 2026 0:19 UTC
26 points
10
I would broadly say that political will for anything in the “slow down AI as needed to make it safe” category has been well short of what many people (such as myself) hoped for.
Certainly! I am not saying we are in a world with enormous political buy-in for slowing down AI, though honestly, I think we are kind of close to where my median was in 2023, maybe a bit above (I never expected much buy-in).
but to me it feels like an even bigger update away from “pause now” movements.
Maybe there is some miscommunication here. I personally do not think that a short pause at any point in the last few years would have been useful on its own, and the primary reason why I would have supported one is because it would have made future pauses much more likely (though that of course depends on the implementation details). If you had asked me about ideal timing on when to halt capabilities progress I think I probably would have suggested capability levels roughly where I expect things to be at in early 2027 (which to be clear, 2 years ago I would have predicted to be more than 3 years away). Of course, we are nowhere near coordinating a global pause within the next year, and so the “stop as soon as possible” position seems right to me at this point.
But to be clear, I wasn’t talking about any kind of pause at all in my comment above, so this is all mostly a distraction from the topic at hand. The central policies that I was referring to when I was talking about “direct risk regulation” are things like:
- Direct liability for harms caused by AI (which would have functioned as a pretty direct and immediate tax on AI development, conveniently a bit more leveraged on the future and more competent AI systems)
- Datacenter construction moratoriums
- Licensing regimes (these are tricky and my guess is would backfire for various reasons, but I do think the proposals I’ve seen were motivated by a “we need to directly intervene on how this technology is being developed right now”)
We’ve seen pretty specific proposals for at least the first two, with SB 1047 introducing a bunch of direct liability, and Bernie Sanders and other congress-people advocating for datacenter moratoriums (in many cases downstream of IMO confused environmental harm effects, but mostly downstream of a general “AI is scary and bad” sentiment that seems reasonably calibrated to me, in as much as anything as incoherent as whatever is motivating these proposals could be called “reasonably calibrated”).
The EU AI act also seems to me to be closer to direct risk regulation than conditional risk regulation in that it directly affects the operation of companies as soon as it takes into effect, and involves directly imposing requirements that all frontier model developers will have to adhere to, as opposed to triggering regulation after certain capability or misalignment thresholds are being met.^[1]
This stands in contrast to what I have perceived to be lack of any motion at all on if-then-commitments at either the state level or the US federal level, or even the AI company policy level. I have not heard any policy proposals that even pass a sniff-test for what an if-then-commitment at the policy level would look like, and having talked to many other people in policy about this, my sense is no one has actually figured out how to even start having something like a capability-evaluation (not to even talk about a misalignment-evaluation) hook into a policy-making apparatus, which I understand to be the central premise of what if-then-commitments were trying to be.
You may have a view like “2023 was the right time to pause, because it was politically tractable then, but postponing it ensured it would not remain politically tractable.” That would be a very different read from mine on the political situation.
Just to reiterate again, none of the things I am saying here have much to do with pausing. I think the top priority of regulation was always to slow down things, ideally incrementally, until you have slowed things down so much that you are de-facto pausing. At no point in history did it seem feasible to me to coordinate a sudden pause in 2023, and while I think humanity would have been better off pausing then (by putting us into a much better position to pause in the future), I am absolutely not comparing “if-then-commitments” to “try to make a pause happen in 2023″ which strikes me as a weird strawman, and IMO would have been a waste of time for people to spend much of their efforts trying to make happen somehow.
Datacenter moratoriums, GPU taxation, extensive auditing requirements, partial nationalization, direct liability, GPU import tariffs, or any of the hundreds of tools that governments around the world have availabl to slow down AI progress are the kinds of things I mean, with the central measure of the success of the regulation being “how much are you successfully reducing AI capability growth rates right now, which are already clearly too high, and to what degree are you putting yourself into a position to reduce AI capability growth rates in the future”.
But even beyond that, I think the key thing that people in policy should be doing, and were largely not doing in 2024/2025 due to a focus on evals, if-then-commitments and attempts to influence government policy by focusing on frontier company internal policies, is to directly talk to policymakers and make the case for existential risk from AI. The conclusion that almost any policymaker who I’ve seen seriously grapple with this topic arrives at is that it is paramount to prevent the creation of artificial superintelligence. After that basic case is established, policymakers have many opinions and much motivation for many regulations that could achieve that. In the long run this does require international treaties and diplomacy, but there are many things we can do in the near term, like slowing down GPU investment, or various forms of indirect taxation of frontier companies in the forms of fines or liability or whatever.
This seems off to me. First, the emphasis on evals predated the idea of if-then commitments, and I think attracted more resources at pretty much every point in time;
I agree that emphasis on evals at e.g. METR predated the focus on RSPs and if-then-commitments, but that was also (if I remember correctly) before evals became one of the hottest things to work on in the broader AI safety ecosystem. My sense is the transition to “evals are the hot thing to work on” coincides closely with the transition to “RSPs are the hot thing to work on”, because indeed, the case for the two was pretty closely entwined. I am not overwhelmingly confident of this, but I remember many conversations with people who were thinking of joining METR and Apollo, and doing grant evaluations of both of those organizations, and in those conversations the RSP case seemed central to me.
After if-then-commitments and RSPs showed themselves as an unpromising direction for policy-interventions, focus among eval orgs (including METR) shifted back towards using evals to inform policymakers and the public about risk, but much of the talent and funding that flowed into those organizations was originally motivated by the RSP/if-then-commitment case (of course, the fact that there was a nearby BATNA if that plan doesn’t work out certainly motivated people to work on it, and I certainly am glad that people considered how gracefully this kind of plan would fail when they decided to join METR and other eval orgs).
Second, I don’t think most people who work on AI safety work on either of these.
My understanding is that most safety efforts at Anthropic were oriented around the RSP in 2024/2025, and my current model is that almost 50% of total safety talent in the ecosystem work at Anthropic. Beyond that, RSP development was (I think) the primary focus of the safety teams that did exist at both OpenAI and Google Deepmind in at least 2024. So I do think that most people who seriously work on AI safety worked on if-then-commitments and RSPs, or at the very least had their work prioritized centrally downstream of efforts to bring by RSPs and if-then-commitments (but I think the stronger thing holds where I do think the majority of the fields efforts went into trying to make RSPs and if-then-commitments happen, not only that their work was structured by RSPs and if-then-commitments).
Of course this is conditional on not including all generic post-training in “AI safety work” which, at least in terms of raw headcount, far exceeds the efforts going into the rest of AI safety. Work on post-training has certainly recruited away many previous high-quality contributors to AI safety, and is sometimes labeled “AI safety” but I think in most cases has little to do with what we are talking about. You might disagree, but I didn’t mean to imply that efforts into evals or if-then-commitments eclipsed broader post-training efforts.
1. ^
  With the exception of compute-thresholds which I think are the only thresholds that ever had much of any shot at being used as the basis for policy and which I remember as being explicitly contrasted with RSP and if-then-commitments, where the latter was being presented as a way of doing regulation that is more sensitive to the risks, and that compute-threshold-based regulation would end up being too restrictive and therefore could not get buy-in
- HoldenKarnofsky 10 Apr 2026 4:21 UTC
  8 points
  3
  Parent
  Hm. I had thought you were pointing to something like “There isn’t actually going to be a pause in this environment triggered by if-then commitments” as the main update/vindication of interest; I was basically responding by pointing out that there also isn’t going to be (and IMO there was never a promising path to) a pause in this environment triggered by advocacy for immediate pausing.
  
  Instead it seems like you’re doing something more like “comparing the overall impact of talking about pauses—or, more broadly, existential risks from AI—with the overall impact of talking about if-then commitments.” I think this is a much muddier comparison where there is less clearly a big update to be had.
  I don’t think we have seen much traction on attempts to slow down AI in any way. Meanwhile, I do think that the framework of “test for dangerous capabilities and implement commensurate mitigations” has had quite a significant impact on company behavior in a way that does seem to set up many policy possibilities that would otherwise be rough (including much of what has already passed).
  The comparison to “raise general awareness about risks of AI” (as opposed to “advocate for specific policies explicitly aimed at slowing down AI”) feels a bit harder to make—certainly I am, and long have been, positive disposed toward raising general awareness about risks of AI.
  
  But I will probably leave that there as it seems like a pretty complex and tricky debate to have.
  My understanding is that most safety efforts at Anthropic were oriented around the RSP in 2024/2025, and my current model is that almost 50% of total safety talent in the ecosystem work at Anthropic.
  
  This doesn’t sound remotely right to me. I would say that the RSP has provided an organizing framework for a lot of safety work, but that’s different from something like “all of that safety work would make no sense if not for the RSP” or something.
  - habryka 10 Apr 2026 7:09 UTC
    4 points
    0
    Parent
    But I will probably leave that there as it seems like a pretty complex and tricky debate to have.
    Seems reasonable. I’ll leave a few quick clarifications.
    Instead it seems like you’re doing something more like “comparing the overall impact of talking about pauses—or, more broadly, existential risks from AI—with the overall impact of talking about if-then commitments.” I think this is a much muddier comparison where there is less clearly a big update to be had.
    The thing that I am comparing is “resources invested into advocating for direct regulation, and actions that would directly slow down AI development” vs. “resources invested into getting companies to adopt RSPs and get policies around RSPs passed”.
    I don’t think we have seen much traction on attempts to slow down AI in any way. Meanwhile, I do think that the framework of “test for dangerous capabilities and implement commensurate mitigations” has had quite a significant impact on company behavior in a way that does seem to set up many policy possibilities that would otherwise be rough (including much of what has already passed).
    I think traction has been not great, but also not terrible. Honestly, I would have been confused if many very concrete useful things had passed by now, since buy-in takes a while to build, but the things that do seem to show motion seem not very RSP-flavored. I do currently think it’s pretty unlikely that regulation that does get passed has much of any grounding in if-then commitments or RSPs, and I am not sure what you are talking about with “set up many policy possibilities that would otherwise be rough”.
    certainly I am, and long have been, positive disposed toward raising general awareness about risks of AI.
    I agree and am deeply grateful for your work in the space, and your support of work in the space.
    This doesn’t sound remotely right to me. I would say that the RSP has provided an organizing framework for a lot of safety work, but that’s different from something like “all of that safety work would make no sense if not for the RSP” or something.
    Hmm, I agree this comparison is tricky, and on-reflection I think I overstated the ratios here. The RSP has been responsible for quite a lot of safety-adjacent work (including a lot of effort spend on cyber-security, and various comms efforts, and the prioritization of various mitigations), but I agree that most of the safety-adjacent work at Anthropic is more driven by other risk models (and are IMO mostly downstream of beliefs around what the tractable parts of the general AI alignment problem are, and which which aspects of alignment-oriented work are most helpful for commercialization), and the RSP prioritization I think is probably more responsible for something like 15%-20% of the work at Anthropic.
    - HoldenKarnofsky 16 Apr 2026 16:56 UTC
      4 points
      0
      Parent
      > The thing that I am comparing is “resources invested into advocating for direct regulation, and actions that would directly slow down AI development” vs. “resources invested into getting companies to adopt RSPs and get policies around RSPs passed”.
      
      This feels like an unfair comparison in that one of these things is much more specific/narrow than the other.
      
      I think fairer comparisons would be:
      
      1. A comparison of two specific/narrow goals along the lines of “Enact policy explicitly aimed at doing X.” Perhaps (a) “Advocating for pausing or stopping AI development, or perhaps for directly slowing it down via deliberate, explicit attempts to slow it down (not e.g. regulation largely or even putatively motivated by something else that happens to increase frictions to AI development)” vs. (b) “advocating for companies to adopt RSPs and get policies around RSPs passed.” I think SB53 and the EU Code of Practice have significant elements that look like the latter, whereas I’d guess we agree the former has gotten nowhere.
      
      2. A comparison of two broad approaches to spreading general messages or frameworks with many possible policy implications. Perhaps (a) “Advocating for direct regulation [which I assume refers to the sort of examples you gave before], and actions that would directly slow down AI development” vs. (b) “Advocating for the general paradigm of doing evaluations to assess current danger, implementing mitigations to reduce current danger, and having some sort of transparency and accountability around these activities.” Here too I think we’ve seen quite a bit more motion from the latter, both from legislation and via voluntary actions from companies, though perhaps you could frame a bit of this as the former, and if you have a certain picture of the risk, you may consider all of the motion on the latter to be worthless.
      
      > I agree and am deeply grateful for your work in the space, and your support of work in the space.
      
      I appreciate your saying so.