habryka comments on Responsible Scaling Policy v3

habryka 25 Feb 2026 1:22 UTC
263 points
123
(See my other comment for thoughts on various individual sections in this post)
Some broader reflections on the overall RSP situation:
About 3 years ago, when the broader AI-safety community had to make a crucial choice about how to relate to government regulation, how to relate to AI capability companies, and how to relate to AI models really looking like they might soon be very competent, there was a big debate about how we should be thinking about regulation and the associated incentives on both governments and companies.
Many people (me included) said that what we should do is to convince policy-makers that progress is already too fast, these systems might soon be very dangerous, and the top priority should be to directly intervene by slowing down AI capabilities progress. Such regulation would be centered around limiting how much compute individual actors or frontier companies could use for model training runs, as those are the most obvious correlates of risk, or other regulation that would directly and immediately impact operations of frontier companies.
Many others said that it was not the right time to start advocating for a slow down or pause to policymakers. Instead we should centrally focus on getting people to make conditional policy commitments. Current policymakers and frontier company employees are not sold that future AI systems will pose a risk, but that’s fine! We simply need them to agree to make if-then-commitments, where if certain risk thresholds or capabilities are met, then they would commit to slowing down.
Different people went different paths, but most of the ecosystem’s resources went into the latter kind of plan, with the two central pillars being to start and invest in evaluation companies like METR and Apollo (to develop evaluations and capability measurements that could provide the ifs), and work at companies or within governments to develop commitments (the thens) based on the evaluations.
I think we should update very substantially against the conditional policy commitment plan. No company or country signed on to if-then commitments, and indeed, the few that did anything like it regretted doing so (as this post illustrates). There are no clear capability evaluations, and the appetite for conditional risk regulation has been substantially less than the appetite for direct risk regulation (or compute thresholding, which doesn’t require any complicated eval infrastructure).
This is a huge deal! This was, as far as I can tell, the single decision that most affected talent allocation in the whole AI safety community. METR and Apollo and the broader “evals” agenda became the most popular and highest-prestige thing to work on for people in AI safety.
I think this post marks a great time for people to reassess their work and whether they should switch to the other branch and advocate for direct and immediately acting policy commitments, whose basis is not uncertainty about whether systems will eventually pose risks and so will not take into effect in any meaningful way immediately until a future trigger sets in. If-then commitments are dead. There are no ifs, there are also no thens.
My guess is many people will disagree with both the history of what I am saying here (this kind of stuff is often tricky and different people have different social experiences), and also disagree that we should make much of a policy update here. I would love to chat with you! I do think that the whole “let’s focus on conditional policy commitments” effort has been a huge waste of resources, and I wish we had never done it, and I would like us to stop sooner rather than later.
What links here?
- Anthropic Responsible Scaling Policy v3: A Matter of Trust by Zvi (1 Apr 2026 18:10 UTC; 65 points)
- ryan_greenblatt 25 Feb 2026 1:40 UTC
  104 points
  18
  Parent
  For the record, I think your perspective on RSPs aged better than mine. I plan on writing more about how I was wrong and why.
  
  (I don’t agree with significant aspects/implications of the comment I’m responding to, though I also think it contains important true claims, I’m just making a claim about your takes in the linked post aging better than my takes.)
- Beth Barnes 1 Mar 2026 4:59 UTC
  27 points
  2
  Parent
  How I think about METR’s theory of change:
  General principles:
  - avoid world being taken by surprise by AI catastrophe—improve knowledge / understanding / science of assessing risk from AI systems—
  independent/trustworthy/truthseeking/minimally-conflicted expert org existing is good—can advise world, be a counterbalance to AI companies; a nonprofit has slightly different affordances than govt here.
  
  Strategy:
  - try to continually answer question of “how dangerous are current / near-future AI systems”, and do research to be able to keep answering that question as well as possible
  - be boring and neutral and straightforward, aim to explain not persuade
  
  Some specific impact stories:
  - at some point in future political willingness may be much higher, help channel that into more informed and helpful response
  - independent technical review and redteaming of alignment + other mitigations, find issues companies have missed
  - increase likelihood that misalignment incidents or other ‘warning shots’ are shared/publicized and analyzed well
  
  I think that broad ToC has been pretty constant throughout METR’s existence, but my memory is not great so I wouldn’t be that surprised if I was framing it pretty differently in the past and e.g. emphasizing conditional commitments more highly.
  What links here?
  - Anthropic Responsible Scaling Policy v3: A Matter of Trust by Zvi (1 Apr 2026 18:10 UTC; 65 points)
- Beth Barnes 1 Mar 2026 4:42 UTC
  22 points
  −7
  Parent
  This is a bit of a nit, but I don’t think METR has consumed that much of the “community resources”, especially of more experienced technical talent—I think only around two or three of current employees at METR were working in fulltime roles on technical AI safety before they joined METR. This is a thing I care about and track in hiring—I don’t want to pull people away from doing other good work.
  (Edited to add the italics, prev claim was overstatement)
  (Second edit: although I still agree with the claim that we haven’t had a large negative impact on talent availability in technical AIS)
  - Ben Pace 1 Mar 2026 4:57 UTC
    27 points
    14
    Parent
    Don’t think I agree. When I scan down the staff, I recognize about half the names as having been around the AI safety scene for 4-8 years, either working on projects or seeking projects. You, Painter, Cotra, Filan, Wijk, Becker, Chan, Kinniment, Jurkovic, Von Arx, Kwa, Dhaliwal, Harris, Chen, have all been part of the AI safety community a long time, and would likely be working on another related project if not for this. Perhaps more that I’m not as immediately familiar with.
    Added: To cache that out for those who don’t know who these people are:
    Beth was an alignment researcher at OpenAI in 2019, and I know cared about this much earlier than that.
    Chris Painter’s LinkedIn shows he worked on “AI Safety via debate” at OpenAI in 2019.
    Ajeya Cotra has been working in the AI part of OpenPhil/Cog since at least 2018.
    Daniel Filan was part of Stuart Russell’s CHAI at UCC Berkeley while he did his PhD there starting in 2016, and has hosted the AI X-Risk podcast since 2020.
    Hjalmar Wijk was a MIRI Summer Fellow in 2019, also an FHI Summer Fellow that year, and I suspect was involved with FHI throughout his Oxford CS PhD in that time.
    Joel Becker was AI Safety grantmaker with Manifund for 2 years before joining METR.
    Lawrence Chan was also part of CHAI under Russell while doing his PhD there since 2018, then worked at Redwood and Alignment Research Center before METR.
    Megan Kinniment was a summer research fellow at the FHI in 2020, did an AI project at the Center on Long-Term Risk that year, then continued to work between those two places before joining ARC.
    Nikola Jurkovic was a research assistant on AI Safety projects at Harvard while a student there in 2023, did a bunch of work on the AI Safety Student Team, before joining METR.
    Sydney Von Arx doesn’t have much online presence, but I know that Beth knows that Von Arx has been working on world-saving projects for a long-time (e.g. she cofounded the Open-Phil-funded Atlas Fellowship) and definitely has oriented to Superintelligent AI as the most important thing in the world since her time as part of Stanford EA.
    Thomas Kwa was a MIRI researcher in 2022.
    Jasmine Dhaliwal was Open Philanthropy Chief of Staff for a year in 2023, then worked on FutureHouse, “A philanthropically-funded moonshot focused on building an AI Scientist”, before joining METR.
    Kit Harris has been around the EA scene as long as I have, so at least a decade. He spent 7 years at Longview Philanthropy where amongst other things he “led grant investigations in artificial intelligence and biosecurity and laid the groundwork for new lines of work at Longview Philanthropy”.
    Michael Chen was a research intern at Stuart Russell’s CHAI in 2022, and his METR profile says “Prior to joining METR, he contributed to research studying AI deception and hazardous knowledge in large language models.”
    These are just the ones that I immediately recognized, I expect if I went through them all I’d find others have also been substantially involved (both professionally and personally) in the AI Safety scene prior to METR. And I count more than two or three people involved in technical AI safety in the above list.
    - Beth Barnes 1 Mar 2026 17:06 UTC
      19 points
      2
      Parent
      I was thinking “was working FT on technical AIS before we hired them” more than “was around this space and might have done other AI safety things”—sorry if that was misleading.
      1. You can count me although I also think I’m not central example of technical AIS work
      2. Chris was mostly working on Alvea and policy stuff before METR, the debate thing was part-time contracting with me and not central example of technical AIS work
      3. Ajeya—wasn’t necessarily counting grantmaking but that’s reasonable (also only joined METR very recently)
      4. Daniel—was counting but I think not central example of FT TAIS work (also only joined METR very recently)
      5. Hjalmar—hired partway through theoretical CS PhD, never had an FT AIS position I don’t think
      6. Joel—pretty sure manifund grantmaking was not close to a FT position?
      7. Lawrence—was counting
      8. Megan—never had an FT AIS position I don’t think
      9. Nikola—hired out of undergrad
      10. Sydney—wasn’t counting as technical AIS
      11. TKwa—not sure, was this FT position?
      12. Jas—wasn’t counting as technical. Also I don’t think Future House counts as safety.
      13. Kit—wasn’t counting as technical (he has math degree but I think fair to say the longview work is not central TAIS)
      14. Michael—never had FT AIS position I don’t think
      David Rein who you missed I think is actually the clearest example
      
      More than one of these people were at least temporarily unusually low-opp-cost for personal reasons that I don’t want to go into here (similar in spirit to ‘health/location constraints made it hard for them to have other jobs’)
      
      In my mind there’s a big contrast here vs e.g. Ant, which I think has a huge number of people with multiple years experience working on technical AIS.
      E.g., people who I know off top of my head:
      Jon Uesato, Jeff Wu, Jan Leike, Chris Olah, Daniel Ziegler, Sam McCandlish, Jared Kaplan, Catherine Olsson, Amanda Askell, Tom Henighan, Shan Carter, Jan Kirchner, Nat McAleese, Carroll Wainright, Todor Markov, Dan Mossing, Steven Bills, William Saunders, Danny Hernandez, Dave Orr, Steven McAleer (all multiple years experience at OAI and/or GDM working on safety teams)
      Evan Hubinger, Sam Bowman, Sam Marks, Fabien Roger, Ethan Perez, Collin Burns, Akbir Khan, Tao Lin, Kshitij Sachan (previously working FT on safety in academia or nonprofits)
      (I expect I’m wrong about ~2 people in those lists)
      
      There are probably a similar number more I’m uncertain about or are non-central examples like the METR ones discussed above.
      - Neel Nanda 1 Mar 2026 20:38 UTC
        9 points
        0
        Parent
        I agree with your assessment here, I don’t think METR has had a significant negative effect on the availability of talent in the technical AGI Safety ecosystem, and Anthropic has had a massive negative one. GDM Safety has probably had a moderate negative one, offset by many people preferring to live in London
      - Ben Pace 1 Mar 2026 21:01 UTC
        3 points
        0
        Parent
        I see. Yes I think your previous claim was an overstatement.
        I also share Habryka’s perspective, I’ve broadly not been sold on technical talent being vastly more important than non-technical talent since MIRI gave up on trying to actually solve the full alignment problem and Christiano stopped working on alignment theory, and I think that many of the people I listed have much more potential to do things that are good than most of the people you listed at Anthropic; but going into more detail on all that would take more time than seems worth it this afternoon.
        Beth Barnes 1 Mar 2026 22:51 UTC
        8 points
        0
        Parent
        FWIW I definitely don’t think technical talent is vastly more important, I just assumed that’s the resource that people would most think METR might be a large consumer of given most of our roles are technical roles
    - [ ]
      [deleted]
  - habryka 1 Mar 2026 19:49 UTC
    9 points
    2
    Parent
    I was thinking “was working FT on technical AIS before we hired them” more than “was around this space and might have done other AI safety things”—sorry if that was misleading.
    I think technical AI Safety work is among the less valuable kinds of work to do on the margin, so I definitely didn’t intend to constrain talent claims to technical AI safety. Indeed, generalist/entrepreneurial/communications talent seems a lot more valuable to me on the margin.
    That said I agree that METR did not consume as much talent as Anthropic or OpenAI, and indeed many people went to work there to work on RSPs and similar if-then-commitment stuff, which didn’t pan out (and now my guess is they are very unlikely to leave). But METR + Apollo seem like the runner-ups right after the labs in terms of where people went to work (and at least at the time largely for if-then-commitment-like reasons).
    What links here?
    Ben Pace's comment on Responsible Scaling Policy v3 by HoldenKarnofsky (1 Mar 2026 21:01 UTC; 3 points)
    - Beth Barnes 1 Mar 2026 22:43 UTC
      8 points
      2
      Parent
      Hm I think UKAISI at least is a lot larger than METR or Apollo?
      
      If you’re focusing more on generalist/entrepreneurial/communications skillsets then e.g. CG has more of these people than METR, I think?
- Buck 25 Feb 2026 6:45 UTC
  22 points
  6
  Parent
  Different people went different paths, but most of the ecosystem’s resources went into the latter kind of plan, with the two central pillars being to start and invest in evaluation companies like METR and Apollo (to develop evaluations and capability measurements that could provide the ifs), and work at companies within governments to develop commitments (the thens) based on the evaluations.
  I think this sort of overstates the proportion of effort that went into that kind of work. There was also a lot of work that aimed to develop techniques that reduce or improve understanding of misalignment risk (e.g. Redwood’s stuff).
  METR and Apollo and the broader “evals” agenda became the most popular and highest-prestige thing to work on for people in AI safety.
  IMO both METR and Apollo substantially pivoted away from the strategy you’re describing here at least a year ago.
  - habryka 25 Feb 2026 17:05 UTC
    22 points
    10
    Parent
    I think this sort of overstates the proportion of effort that went into that kind of work. There was also a lot of work that aimed to develop techniques that reduce or improve understanding of misalignment risk (e.g. Redwood’s stuff).
    I think “most” is roughly accurate (like IDK, my sense is around 60% of talent + funding was reallocated to plans of that kind). I agree that other people kept doing different things!
    I do think there aren’t that many places that do work around reducing or understanding misalignment risk, especially outside of the labs (which I am excluding here).
    IMO both METR and Apollo substantially pivoted away from the strategy you’re describing here at least a year ago.
    I am honestly confused what METR’s current theory of impact is.
    It seems most effort is going into things like the time horizon evaluations, but it’s not super clear how this translates into the world getting better (though I am generally of the school that helping people understand what is going on will make things better, even if you can’t specify exactly how, so I do think it’s good).
    I have been appreciative of METR staff being more public and calling directly for regulations/awareness of the risks, but things still haven’t come together for me in a coherent way, but in as much as METR “pivoted”, I am not quite sure what it has pivoted to.
- HoldenKarnofsky 12 Mar 2026 23:53 UTC
  14 points
  −3
  Parent
  FWIW, my interpretation of what we should be learning is pretty different here.
  I would broadly say that political will for anything in the “slow down AI as needed to make it safe” category has been well short of what many people (such as myself) hoped for. Because of this, some of the core founding hopes of the RSP project look untenable now (although I don’t consider the matter totally settled); but to me it feels like an even bigger update away from “pause now” movements.
  I don’t understand why you say this: “the appetite for conditional risk regulation has been substantially less than the appetite for direct risk regulation (or compute thresholding, which doesn’t require any complicated eval infrastructure).” I have not seen roughly any appetite for “compute thresholding” if that means something like “limiting the size of training runs” (I have seen “compute thresholding” in the sense of “reporting requirements triggered by compute thresholds”). I don’t know what you mean by “direct risk regulation”, but if it means regulation aimed at slowing down AI immediately, I also have seen much less (roughly no) appetite/momentum for that, and more for regulation based around things like evals and if-then commitments.
  Separately, with the benefit of hindsight, I think a global AI pause in 2023 would have been bad on the merits compared to, say, a pause around when the original RSP implied a pause should happen. The former, compared to the latter, would have meant losing a lot of opportunities for meaningful alignment research and more broadly for the world to learn important things relevant to AI safety, while having almost no marginal catastrophic risk reduction benefit AFAICT.
  You may have a view like “2023 was the right time to pause, because it was politically tractable then, but postponing it ensured it would not remain politically tractable.” That would be a very different read from mine on the political situation.
  > This is a huge deal! This was, as far as I can tell, the single decision that most affected talent allocation in the whole AI safety community. METR and Apollo and the broader “evals” agenda became the most popular and highest-prestige thing to work on for people in AI safety.
  This seems off to me. First, the emphasis on evals predated the idea of if-then commitments, and I think attracted more resources at pretty much every point in time; evals have a variety of potential benefits that don’t rely on if-then commitments. Second, I don’t think most people who work on AI safety work on either of these.
  What links here?
  - Anthropic Responsible Scaling Policy v3: A Matter of Trust by Zvi (1 Apr 2026 18:10 UTC; 65 points)
  - habryka 14 Mar 2026 0:19 UTC
    26 points
    10
    Parent
    I would broadly say that political will for anything in the “slow down AI as needed to make it safe” category has been well short of what many people (such as myself) hoped for.
    Certainly! I am not saying we are in a world with enormous political buy-in for slowing down AI, though honestly, I think we are kind of close to where my median was in 2023, maybe a bit above (I never expected much buy-in).
    but to me it feels like an even bigger update away from “pause now” movements.
    Maybe there is some miscommunication here. I personally do not think that a short pause at any point in the last few years would have been useful on its own, and the primary reason why I would have supported one is because it would have made future pauses much more likely (though that of course depends on the implementation details). If you had asked me about ideal timing on when to halt capabilities progress I think I probably would have suggested capability levels roughly where I expect things to be at in early 2027 (which to be clear, 2 years ago I would have predicted to be more than 3 years away). Of course, we are nowhere near coordinating a global pause within the next year, and so the “stop as soon as possible” position seems right to me at this point.
    But to be clear, I wasn’t talking about any kind of pause at all in my comment above, so this is all mostly a distraction from the topic at hand. The central policies that I was referring to when I was talking about “direct risk regulation” are things like:
    Direct liability for harms caused by AI (which would have functioned as a pretty direct and immediate tax on AI development, conveniently a bit more leveraged on the future and more competent AI systems)
    Datacenter construction moratoriums
    Licensing regimes (these are tricky and my guess is would backfire for various reasons, but I do think the proposals I’ve seen were motivated by a “we need to directly intervene on how this technology is being developed right now”)
    We’ve seen pretty specific proposals for at least the first two, with SB 1047 introducing a bunch of direct liability, and Bernie Sanders and other congress-people advocating for datacenter moratoriums (in many cases downstream of IMO confused environmental harm effects, but mostly downstream of a general “AI is scary and bad” sentiment that seems reasonably calibrated to me, in as much as anything as incoherent as whatever is motivating these proposals could be called “reasonably calibrated”).
    The EU AI act also seems to me to be closer to direct risk regulation than conditional risk regulation in that it directly affects the operation of companies as soon as it takes into effect, and involves directly imposing requirements that all frontier model developers will have to adhere to, as opposed to triggering regulation after certain capability or misalignment thresholds are being met.^[1]
    This stands in contrast to what I have perceived to be lack of any motion at all on if-then-commitments at either the state level or the US federal level, or even the AI company policy level. I have not heard any policy proposals that even pass a sniff-test for what an if-then-commitment at the policy level would look like, and having talked to many other people in policy about this, my sense is no one has actually figured out how to even start having something like a capability-evaluation (not to even talk about a misalignment-evaluation) hook into a policy-making apparatus, which I understand to be the central premise of what if-then-commitments were trying to be.
    You may have a view like “2023 was the right time to pause, because it was politically tractable then, but postponing it ensured it would not remain politically tractable.” That would be a very different read from mine on the political situation.
    Just to reiterate again, none of the things I am saying here have much to do with pausing. I think the top priority of regulation was always to slow down things, ideally incrementally, until you have slowed things down so much that you are de-facto pausing. At no point in history did it seem feasible to me to coordinate a sudden pause in 2023, and while I think humanity would have been better off pausing then (by putting us into a much better position to pause in the future), I am absolutely not comparing “if-then-commitments” to “try to make a pause happen in 2023″ which strikes me as a weird strawman, and IMO would have been a waste of time for people to spend much of their efforts trying to make happen somehow.
    Datacenter moratoriums, GPU taxation, extensive auditing requirements, partial nationalization, direct liability, GPU import tariffs, or any of the hundreds of tools that governments around the world have availabl to slow down AI progress are the kinds of things I mean, with the central measure of the success of the regulation being “how much are you successfully reducing AI capability growth rates right now, which are already clearly too high, and to what degree are you putting yourself into a position to reduce AI capability growth rates in the future”.
    But even beyond that, I think the key thing that people in policy should be doing, and were largely not doing in 2024/2025 due to a focus on evals, if-then-commitments and attempts to influence government policy by focusing on frontier company internal policies, is to directly talk to policymakers and make the case for existential risk from AI. The conclusion that almost any policymaker who I’ve seen seriously grapple with this topic arrives at is that it is paramount to prevent the creation of artificial superintelligence. After that basic case is established, policymakers have many opinions and much motivation for many regulations that could achieve that. In the long run this does require international treaties and diplomacy, but there are many things we can do in the near term, like slowing down GPU investment, or various forms of indirect taxation of frontier companies in the forms of fines or liability or whatever.
    This seems off to me. First, the emphasis on evals predated the idea of if-then commitments, and I think attracted more resources at pretty much every point in time;
    I agree that emphasis on evals at e.g. METR predated the focus on RSPs and if-then-commitments, but that was also (if I remember correctly) before evals became one of the hottest things to work on in the broader AI safety ecosystem. My sense is the transition to “evals are the hot thing to work on” coincides closely with the transition to “RSPs are the hot thing to work on”, because indeed, the case for the two was pretty closely entwined. I am not overwhelmingly confident of this, but I remember many conversations with people who were thinking of joining METR and Apollo, and doing grant evaluations of both of those organizations, and in those conversations the RSP case seemed central to me.
    After if-then-commitments and RSPs showed themselves as an unpromising direction for policy-interventions, focus among eval orgs (including METR) shifted back towards using evals to inform policymakers and the public about risk, but much of the talent and funding that flowed into those organizations was originally motivated by the RSP/if-then-commitment case (of course, the fact that there was a nearby BATNA if that plan doesn’t work out certainly motivated people to work on it, and I certainly am glad that people considered how gracefully this kind of plan would fail when they decided to join METR and other eval orgs).
    Second, I don’t think most people who work on AI safety work on either of these.
    My understanding is that most safety efforts at Anthropic were oriented around the RSP in 2024/2025, and my current model is that almost 50% of total safety talent in the ecosystem work at Anthropic. Beyond that, RSP development was (I think) the primary focus of the safety teams that did exist at both OpenAI and Google Deepmind in at least 2024. So I do think that most people who seriously work on AI safety worked on if-then-commitments and RSPs, or at the very least had their work prioritized centrally downstream of efforts to bring by RSPs and if-then-commitments (but I think the stronger thing holds where I do think the majority of the fields efforts went into trying to make RSPs and if-then-commitments happen, not only that their work was structured by RSPs and if-then-commitments).
    Of course this is conditional on not including all generic post-training in “AI safety work” which, at least in terms of raw headcount, far exceeds the efforts going into the rest of AI safety. Work on post-training has certainly recruited away many previous high-quality contributors to AI safety, and is sometimes labeled “AI safety” but I think in most cases has little to do with what we are talking about. You might disagree, but I didn’t mean to imply that efforts into evals or if-then-commitments eclipsed broader post-training efforts.
    ^
    With the exception of compute-thresholds which I think are the only thresholds that ever had much of any shot at being used as the basis for policy and which I remember as being explicitly contrasted with RSP and if-then-commitments, where the latter was being presented as a way of doing regulation that is more sensitive to the risks, and that compute-threshold-based regulation would end up being too restrictive and therefore could not get buy-in
    - HoldenKarnofsky 10 Apr 2026 4:21 UTC
      8 points
      3
      Parent
      Hm. I had thought you were pointing to something like “There isn’t actually going to be a pause in this environment triggered by if-then commitments” as the main update/vindication of interest; I was basically responding by pointing out that there also isn’t going to be (and IMO there was never a promising path to) a pause in this environment triggered by advocacy for immediate pausing.
      
      Instead it seems like you’re doing something more like “comparing the overall impact of talking about pauses—or, more broadly, existential risks from AI—with the overall impact of talking about if-then commitments.” I think this is a much muddier comparison where there is less clearly a big update to be had.
      I don’t think we have seen much traction on attempts to slow down AI in any way. Meanwhile, I do think that the framework of “test for dangerous capabilities and implement commensurate mitigations” has had quite a significant impact on company behavior in a way that does seem to set up many policy possibilities that would otherwise be rough (including much of what has already passed).
      The comparison to “raise general awareness about risks of AI” (as opposed to “advocate for specific policies explicitly aimed at slowing down AI”) feels a bit harder to make—certainly I am, and long have been, positive disposed toward raising general awareness about risks of AI.
      
      But I will probably leave that there as it seems like a pretty complex and tricky debate to have.
      My understanding is that most safety efforts at Anthropic were oriented around the RSP in 2024/2025, and my current model is that almost 50% of total safety talent in the ecosystem work at Anthropic.
      
      This doesn’t sound remotely right to me. I would say that the RSP has provided an organizing framework for a lot of safety work, but that’s different from something like “all of that safety work would make no sense if not for the RSP” or something.
      - habryka 10 Apr 2026 7:09 UTC
        4 points
        0
        Parent
        But I will probably leave that there as it seems like a pretty complex and tricky debate to have.
        Seems reasonable. I’ll leave a few quick clarifications.
        Instead it seems like you’re doing something more like “comparing the overall impact of talking about pauses—or, more broadly, existential risks from AI—with the overall impact of talking about if-then commitments.” I think this is a much muddier comparison where there is less clearly a big update to be had.
        The thing that I am comparing is “resources invested into advocating for direct regulation, and actions that would directly slow down AI development” vs. “resources invested into getting companies to adopt RSPs and get policies around RSPs passed”.
        I don’t think we have seen much traction on attempts to slow down AI in any way. Meanwhile, I do think that the framework of “test for dangerous capabilities and implement commensurate mitigations” has had quite a significant impact on company behavior in a way that does seem to set up many policy possibilities that would otherwise be rough (including much of what has already passed).
        I think traction has been not great, but also not terrible. Honestly, I would have been confused if many very concrete useful things had passed by now, since buy-in takes a while to build, but the things that do seem to show motion seem not very RSP-flavored. I do currently think it’s pretty unlikely that regulation that does get passed has much of any grounding in if-then commitments or RSPs, and I am not sure what you are talking about with “set up many policy possibilities that would otherwise be rough”.
        certainly I am, and long have been, positive disposed toward raising general awareness about risks of AI.
        I agree and am deeply grateful for your work in the space, and your support of work in the space.
        This doesn’t sound remotely right to me. I would say that the RSP has provided an organizing framework for a lot of safety work, but that’s different from something like “all of that safety work would make no sense if not for the RSP” or something.
        Hmm, I agree this comparison is tricky, and on-reflection I think I overstated the ratios here. The RSP has been responsible for quite a lot of safety-adjacent work (including a lot of effort spend on cyber-security, and various comms efforts, and the prioritization of various mitigations), but I agree that most of the safety-adjacent work at Anthropic is more driven by other risk models (and are IMO mostly downstream of beliefs around what the tractable parts of the general AI alignment problem are, and which which aspects of alignment-oriented work are most helpful for commercialization), and the RSP prioritization I think is probably more responsible for something like 15%-20% of the work at Anthropic.
        HoldenKarnofsky 16 Apr 2026 16:56 UTC
        4 points
        0
        Parent
        > The thing that I am comparing is “resources invested into advocating for direct regulation, and actions that would directly slow down AI development” vs. “resources invested into getting companies to adopt RSPs and get policies around RSPs passed”.
        
        This feels like an unfair comparison in that one of these things is much more specific/narrow than the other.
        
        I think fairer comparisons would be:
        
        1. A comparison of two specific/narrow goals along the lines of “Enact policy explicitly aimed at doing X.” Perhaps (a) “Advocating for pausing or stopping AI development, or perhaps for directly slowing it down via deliberate, explicit attempts to slow it down (not e.g. regulation largely or even putatively motivated by something else that happens to increase frictions to AI development)” vs. (b) “advocating for companies to adopt RSPs and get policies around RSPs passed.” I think SB53 and the EU Code of Practice have significant elements that look like the latter, whereas I’d guess we agree the former has gotten nowhere.
        
        2. A comparison of two broad approaches to spreading general messages or frameworks with many possible policy implications. Perhaps (a) “Advocating for direct regulation [which I assume refers to the sort of examples you gave before], and actions that would directly slow down AI development” vs. (b) “Advocating for the general paradigm of doing evaluations to assess current danger, implementing mitigations to reduce current danger, and having some sort of transparency and accountability around these activities.” Here too I think we’ve seen quite a bit more motion from the latter, both from legislation and via voluntary actions from companies, though perhaps you could frame a bit of this as the former, and if you have a certain picture of the risk, you may consider all of the motion on the latter to be worthless.
        
        > I agree and am deeply grateful for your work in the space, and your support of work in the space.
        
        I appreciate your saying so.
- otto.barten 27 Feb 2026 12:05 UTC
  5 points
  0
  Parent
  the appetite for conditional risk regulation has been substantially less than the appetite for direct risk regulation
  Where do you see the latter appetite?
  We campaigned a bit for a conditional treaty. We’d happily sign up for un unconditional pause though. Problem is: there is no appetite for either, right?
  I agree that the manpower spent on evals should have been spent on other things with a better theory of change. Eval quality imo is not a crux for regulation, awareness and political support are. I think the money that went to evals should have gone to raising awareness and lobbying.
  Honestly why is there still no significant funding for awareness raising projects? It’s so easy: just ask for amount of views/copies and conversion rates measured via e.g. Prolific surveys and fund the most effective projects. A fund like this can easily absorb millions. I think this might actually get regulation off the ground.
  - habryka 27 Feb 2026 14:04 UTC
    3 points
    0
    Parent
    Where do you see the latter appetite?
    Since very little frontier regulation has passed, the best we have is watching various people in congress who have been concerned about risk speak, and there I’ve seen many more direct risk regulation proposals than conditional risk regulation proposals.
    Also in as much as we had any success, establishing direct liability via something like SB1047 was the closest we got, which didn’t have any dependence on evals.