Responsible Scaling Policy v3
All views are my own, not Anthropic’s. This post assumes Anthropic’s announcement of RSP v3.0 as background.
Today, Anthropic released its Responsible Scaling Policy 3.0. The official announcement discusses the high-level thinking behind it. This is a more detailed post giving my own takes on the update.
First, the big picture:
I expect some people will be upset about the move away from a “hard commitments”/”binding ourselves to the mast” vibe. (Anthropic has always had the ability to revise the RSP, and we’ve always had language in there specifically flagging that we might revise away key commitments in a situation where other AI developers aren’t adhering to similar commitments. But it’s been easy to get the impression that the RSP is “binding ourselves to the mast” and committing to unilaterally pause AI development and deployment under some conditions, and Anthropic is responsible for that.)
I take significant responsibility for this change. I have been pushing for this change for about a year now, and have led the way in developing the new RSP. I am in favor of nearly everything about the changes we’re making. I am excited about the Roadmap, the Risk Reports, the move toward external review, and unwinding of some of the old requirements that I felt were distorting our safety efforts (more on all of this below).
I think these changes are the right thing for reducing AI risk, both from Anthropic and from other companies if they make similar changes (as I hope they do).
In my mind, this revision isn’t being prompted by “catastrophic risk from today’s AI systems is now high” (I don’t think it is), or by “We’ve just realized that sufficient regulation isn’t looking likely” (I think this is not a very recent update). First and foremost, in my mind, it is about learning from design flaws and making improvements.
I always thought of the original RSP as a “v1” that would be iterated on, and have been frustrated to see the extent to which it’s been interpreted as a “sacred cow” or “binding oneself to the mast” such that revisions that go in a “less self-binding” direction are seen by many as inherently dishonorable. I would’ve pushed for a very different (and far less ambitious) initial design if I’d thought this way about future changes.
I generally think it’s bad to create an environment that encourages people to be afraid of making mistakes, afraid of admitting mistakes and reticent to change things that aren’t working. I think that dynamic currently applies somewhat to RSP-like policies, and I hope that changes.
I’m not saying that I wish people shrugged off every revision with “Hey, it’s your policy, do what you want.” I wish people simply evaluated whether the changes seem good on the merits, without starting from a strong presumption that the mere fact of changes is either a bad thing or a fine thing. It should be hard to change good policies for bad reasons, not hard to change all policies for any reason.
I think a lot of people have a mentality like “I worry that AI companies will do the wrong thing when it comes down to it, and I’m looking for them to bind their future actions as rigidly and tightly as possible; policies that are vaguer and more flexible just leave more room for motivated reasoning.” I think that is fair as far as it goes, but there’s a flipside: the world (especially the world of AI) changes fast, and binding commitments about the future can leave you bound to things that aren’t actually good for safety, and can make it hard to adapt to the situation as it actually stands and allocate resources effectively—a dynamic that risks poor prioritization, goodharting, and general annoyance and backlash that I think one should care about quite a lot. I’ve always been aware of the latter issue and tried to be careful about what kinds of commitments are worth making, but I’ve updated toward thinking that binding commitments are harder to get right than I’d thought, while the kinds of policies that look frustratingly vague to many people can actually have a huge impact in the right context.
I think a common attitude toward RSP v3.0 is “I reluctantly see why this is necessary, and I ultimately agree that the changes are needed and reasonable, but I’m sad about it.” That’s not my attitude. I am sad that we’re in the political environment we’re in, but taking this not-terribly-recent situation as a given, I am affirmatively excited about the new RSP. I think the old RSP did some things really well, but also had some perverse effects that don’t seem widely understood and risked growing a lot as AI systems become more capable (more below). I think the new one is better and will ultimately be a more effective force for risk reduction. (I don’t think either could keep risks bound to low levels on a voluntary basis.) I’m sure there are problems with this RSP too, and at some point there will likely be an RSP 4.0 with more changes, but I do think we’re learning about what works and doesn’t work, and these policies will hopefully have growing positives and shrinking negatives—if we can treat them as works in progress rather than sacred cows.
I think my viewpoint is probably easiest to understand via my story about the evolution and impacts (good and bad) of RSPs since the beginning. This will cover what the original goals were, why we approached them the way we did initially, what went well with that approach, and what didn’t, which motivates the changes we’re making now.
After that is an FAQ section.
How it started: the original goals of RSPs
In 2023, I collaborated with METR to develop and pitch the basic idea of Responsible Scaling Policies. What we were trying to do is pretty well captured in the first blog post METR wrote about them. Today I’d summarize the goals roughly as follows (please read this as reporting my own thinking on the goals of RSPs, rather than METR’s or anyone else’s):
Goal 1: create forcing functions for AI developers to move with urgency and focus on risk mitigations. The idea was: if a company has a policy saying it isn’t safe to train an AI model with X level of capabilities unless Y risk mitigations are in place, then hopefully the company is going to try very hard to get those risk mitigations in place. This doesn’t rely on commitments being ironclad, only on something like “it would be embarrassing to fall short of a standard the company has said it is trying to hit and associates with safe AI development, and companies will work to avoid that kind of embarrassment.”
Goal 2: create a testbed for practices and policies that can feed into policy frameworks. The idea was: if many major AI developers have adopted risk mitigation Y, and/or agreed that risk mitigation Y is important for safety, then this will make it easier (both politically and practically) for regulation to require or nudge risk mitigation Y, or gate AI development or deployment on it. Again, this doesn’t rely on commitments being ironclad—any given risk mitigation that is widely practiced and/or publicly supported by industry at the moment can have a policy impact.
(A bit more specifics: I hoped that if political will for AI risk reduction ended up strong, then voluntary practices and policies by companies would be taken as a “floor” for regulation, whereas if political will for AI risk reduction ended up weak, these might be the best we could get.)
Goal 3: work toward consensus and common knowledge about AI risks and potential mitigations. There was already a lot of interest in evaluating AI systems for dual-use/dangerous capabilities, but we hoped that Responsible Scaling Policies would increase efforts to tie capabilities to threat models, and generally improve the level of common knowledge about whether AI systems were becoming dangerous.
Not a core goal: bring about a substantial, voluntary (non-regulation-backed) pause in AI development. At the time, many people seemed to assume that this was a goal of ours, and criticized the RSP effort on grounds that it seemed unrealistic. While I don’t think it was or is totally implausible that a small number of AI developers could (if they were far ahead of all others) slow down AI development by some amount due to policies like this, it was never a major part of the hope, nor was it necessary to achieve the other goals listed above.
Escape clauses. METR’s intro post on Responsible Scaling Policies included this:
What if RSPs slow down the companies that adopt them, but others rush forward?
One way RSPs could fail to reduce risk — or even increase it — would be if they resulted in the following dynamic: “Cautious AI developers end up slowing down in order to avoid risks, while incautious AI developers move forward as fast as they can.”
Developers can reduce this risk by writing flexibility into their RSPs, along these lines:
If they believe that risks from other actors’ continued scaling are unacceptably high, and they have exhausted other avenues for preventing these risks, including advocating intensively for regulatory action, then in some scenarios they may continue scaling themselves — while continuing to work with states or other authorities to take immediate actions to limit scaling that would affect all AI developers (including themselves).
In this case, they should be explicit — with employees, with their board, and with state authorities — that they are invoking this clause and that their scaling is no longer safe. They should be clear that there are immediate (not future hypothetical) catastrophic risks from AI systems, including their own, and they should be accountable for the decision to proceed.
RSPs with this kind of flexibility would still require rigorous testing for dangerous capabilities. They would still call for prioritizing protective measures (to avoid having to explicitly move forward with dangerous AI). And they’d still be a first step toward stricter evals-based rules and norms (as discussed in the previous section).
In hindsight, I think this language overestimated how well capability evaluations would inform the world about risks from AI, underestimated the “grey zone” problem described here, and/or generally overestimated the level of political will and appetite for an “assume AI is dangerous until proven safe” attitude among policymakers.[1]For example: In May 2025, Anthropic activated ASL-3 protections because it felt it could no longer make a good enough case that the relevant risk was low—but over nine months later and despite significant effort including a well-resourced randomized controlled trial (results forthcoming), we still lack compelling evidence that it is high either. So I think it was a mistake to imagine that there was a single “risk line” we’d cross, early enough to prevent companies from imposing significant risk but late enough that intensive advocacy for global action would have realistic prospects of succeeding.
If I were writing this today, I’d tone down the language about “unacceptably high” / “immediate risk,” but I’d still believe:
It’s not good, and not part of the model of RSPs as I see (and have seen) it, to get responsible actors to unilaterally slow down when others won’t.
Even without this—and even without the “dire” language above—RSPs can do a lot in the service of Goals 1-3 above, simply via articulating what the company believes it would take to make risks low.
How it’s going: the good and the bad
About a year after Anthropic first adopted its RSP, I started to spend a lot of time advising on the company’s efforts to execute on it and improve it over time, and I went to Anthropic full-time in January of 2025 with the RSP as my primary focus. Since then, I’ve developed a number of opinions about what’s good and bad about the old RSP.
A note on my general orientation toward this topic
Before I get into what I think has gone well and poorly, I think it’s worth covering a high-level difference in how I approach this basic topic vs. some others.
I think some people have a picture that is roughly like this (with apologies for oversimplification): “The more strict and uncompromising the commitments made by companies, the better. We aren’t trying to ‘balance business goals against risk reduction’ here, because we don’t value business goals. We don’t trust companies to do the right thing, so the more good-seeming actions get set in stone at any given time, the better.”
That’s not how I see it. I think we should be trying to get the most risk reduction we can get per unit of business pain, or something like that—not because I intrinsically value business goals, but because I think a “balancing” attitude is ultimately better for risk reduction.
If RSPs end up pushing for risk mitigations that take huge amounts of work and/or impose huge business costs for modest safety benefit, then they risk:
Hurting the flexibility of genuinely safety-oriented companies, or safety-oriented people working within companies (who often do a lot of the work RSPs create, and risk being diverted from what should be their top priorities to RSP compliance).
Leading to “goodharting”: when a risk mitigation seems broadly unreasonable to the people involved in implementing it, people may intensively try to meet its letter rather than its spirit.
Becoming a target for backlash, making a broad set of people dislike RSPs and safety generally.
If there were strong and broad political will for treating AI like nuclear power and slowing it down arbitrarily much to keep risks low, the situation might be different. But that isn’t the world we’re in now, and I fear that “overreaching” can be costly.
With this in mind, I tend to think that any kind of rigid commitment in a policy like this is a double-edged sword:
If one succeeds at anticipating the future well enough to make the right commitment, this has the benefit of making it “harder to do the wrong thing” later.
If one doesn’t, this can make it harder to do the right thing, causing the problems above.
The lower our confidence in being able to articulate a robustly good commitment, the more we should worry that commitments will end up either pushing companies toward the wrong actions, or pushing them toward actions that would be right in a vacuum (if all anyone cared about was risk reduction) but present the kind of disproportionate tradeoff (big “pain” on the commercial side for small benefit on the risk reduction side) that can lead to backlash and loss of credibility.
It’s also worth noting that my goal is generally to find ways to reduce risk on the margin. I don’t model our situation via a “logistic success curve” such that we either implement dramatically better safety measures than anyone is on track for today, or are ~assured of disaster. My hope for RSPs has been, and is, that they will increase and improve risk mitigation efforts, including in incremental ways. (My most recent 80k interview elaborates on this viewpoint.)
Goal 1: forcing functions for improved risk mitigations
A partial success story: robustness to jailbreaks for particular uses of concern, in line with the ASL-3 deployment standard
I believe that Anthropic has achieved a level of robustness to jailbreaks of its models (for particular uses of concern, targeted by extra-robust classifiers) that it would not have achieved in the absence of something very much like the RSP.
While the potential harm here (AI assisting novices in producing chemical and biological weapons) isn’t in the maximal-stakes category that I expect this post’s readers to care about most, I think some of the muscles we’ve been building to achieve this are broadly useful for a future in which we might try to achieve robust defenses against misuse. More broadly, I certainly think this case is a proof of concept that the RSP can drive a company to accomplish things it wouldn’t otherwise accomplish on the risk mitigation front.
Some of this is about the development of Constitutional Classifiers, which were prioritized by researchers largely because the RSP created pressure to achieve robustness to jailbreaks.[2]But a lot of it is more about achieving common knowledge and coordination within the company. Achieving even moderate robustness against jailbreaks requires a lot of things to happen that touch many different parts of the company, each with different reporting lines, personnel, priorities, beliefs, etc. Among other things, it requires (a) integrating the relevant classifier guards into production systems; (b) trying to make these systems run as smoothly and cheaply as possible while still running sufficiently strong classifier guards; (c) continuing to update how production systems function as the classifier guards evolve; (d) dealing with customer feedback when classifier guards cause pain (e.g., via false positives); (e) adjudicating customer requests for exemptions from being blocked by these classifiers; (f) working with partners who serve our models on other platforms to ensure those models are following the same basic rules and running the same classifier guards; etc.
In many cases, it’s necessary to get contributions, help and buy-in from people who are busy and have other top priorities; who have their own impressions of what’s dangerous and what isn’t; etc.
The RSP serves as a clear statement of Anthropic’s goal of achieving high robustness to jailbreaks. The Responsible Scaling Officer is looped in on the many decisions across the company that may affect the ability to achieve this goal, and has regular meetings with people working in many different departments to coordinate the different aspects of it. As an RSP advisor reporting to the Responsible Scaling Officer, I was brought in to advise on many questions raised by different teams about what we should be doing to make sure we were meeting the high-level commitments associated with ASL-3.[3]
I feel pretty uncertain about whether the goal of achieving robustness to jailbreaks for these particular uses of concern was worth all of the energy and prioritization it got. But it certainly got a lot of those things, and I think we at least have a proof of concept that the RSP can get a company to prioritize and execute on things that otherwise would likely meet a fate like “Some people prioritize this, some don’t, and the result is low robustness since a lot of things have to go right to achieve high robustness.” This seems, at least, like the sort of tool that has a lot of potential for reducing extreme risks from AI, if it can be harnessed well.
A mixed success/failure story: impact on information security
I think the RSP galvanized a concerted effort to achieve “ASL-3 security” across the company. In practice, this ended up meaning a large focus on a handful of particular security measures that were judged to be especially important or promising for this goal—especially egress bandwidth controls (described here).
I think this led to more aggressive capacity building for security, more of a focus on protecting model weights, and certainly more effort on egress bandwidth controls. I think the increased capacity building was good, but the other two were mixed.
It isn’t obvious to me that model weight security is the best aspect of security to prioritize (I discuss this a bit more at my 80k podcast). (I also take responsibility for emphasizing this aspect of security in the past. I still think it is extremely important, but I have a better sense today of just how difficult it is to achieve model weight security against the strongest attackers, and this affects what I think is most worth prioritizing, as discussed in the podcast.)
It seems to me that implementing egress bandwidth controls was a pretty good bet and reasonable use of energy, but there is room for debate here.
I worry that prioritizing these things may have caused us to underinvest in the “unsexy” side of security, which you can get a sense for from the security discussion in our new Frontier Safety Roadmap. There are a lot of little things we should improve to be more generically secure across the board. Many of them apply to many threat models, and aren’t super specific to model weights.
I also think there was confusion about what exactly we were trying to do w/r/t the “ASL-3” security standard.
Some people believed we were aiming for a state where even sophisticated technical Anthropic employees with authorized access to model weights wouldn’t be able to steal them. I think this would have been an unrealistic goal for the amount of time we ended up having (more below on whether pausing AI development to meet such a goal would have made sense).
Some people believed we were going for a lower bar than that, specifically that we weren’t aiming for robustness to sophisticated insiders. I think we probably meet this lower bar even without things like egress bandwidth controls.
ASL-4 and ASL-5 prep: the wrong incentives
The previous RSP didn’t give a lot of detail about the ASL-4 and ASL-5 standards, but I think it was generally understood to imply that we will need to protect our model weights from attacks from state-backed programs—and perhaps achieve jailbreak robustness against these as well—when we reach AI capabilities associated with “CBRN-4” and “AI R&D-5.” The company’s leadership expects a reasonable probability of capabilities that would likely cross these thresholds within the next 2 years.
I don’t believe there is a plausible path to achieving that kind of robustness on that kind of time frame, except by either pausing AI development (potentially for years), or prioritizing security to such a degree that it has a similar effect (e.g., deploying models only in very limited settings).
I don’t think slowing down like this would be a good idea for Anthropic on the merits.
Perhaps it would be, if we could be assured that the rest of the AI ecosystem would behave similarly. But I don’t think that’s plausible. I don’t think of the race to powerful AI as a coordination problem, and I don’t think of improving such coordination as a key goal of RSPs (I discuss this a bit more at my 80k podcast).
Perhaps it would be, if our slowdown caused the rest of the world to “take notice” and move toward safety-preserving policies. The problem with this is the “grey zone” between not being able to make a good case that risks are low and being able to make a broadly compelling case that risks are high. This is a potentially vast gap, as discussed above, and I expect to be in the former situation for a long time before entering the latter situation. I don’t think a unilateral slowdown would necessarily be effective in such a situation; it seems more likely that it would be counterproductive and mostly be seen as crying wolf.
To be clear, I think it is possible that there will be a future situation in which unilaterally pausing frontier AI development in order to help “sound the alarm” is our best option. But it depends heavily on a lot of things such as the state of evidence, the state of the political landscape, etc., and there are many imaginable worlds in which doing this would be a bad move. (More below.) So I don’t think it makes sense to be committed to this course of action in the way that the old RSP implies.
Here is what I was seeing as Anthropic tried to work through what ASL-4 preparation would look like:
The urgency and focus of the path to ASL-3 was no longer present. The goals associated with ASL-4 were daunting and abstract. It felt like our basic choices were either to throw “Hail Marys”—try for things that might result in highly robust risk mitigations, but would probably just waste a lot of resources and energy—or to simply work to improve security incrementally, hoping that we wouldn’t have to meet the requirements very soon (either thanks to a regulation-backed pause or thanks to slow AI capabilities improvements).
It also felt like our risk assessment was subject to distortive pressures. We knew that if we declared a model to cross the CBRN-4 or AI R&D-5 line, this could be extremely damaging to the company (in that our RSP would then require a unilateral pause or slowdown in AI development and deployment), while having little discernible public benefit (see above). It seemed to me that there was an enormous amount of pressure to declare our systems to lack relevant capabilities, to declare our risk mitigations to be on track to be strong enough, etc. I don’t think we have actually made unreasonable calls, but I have felt the pressure and wish we weren’t in that world.
Overall, it felt to me that the requirements people perceived the RSP to impose on us here were unreasonable on the merits (on a “unilateral action” basis), detrimental to our risk assessment and broader epistemic environment, and detrimental to our ability to make reasonable plans for making our risk mitigations as good as we could make them.
I want companies like Anthropic to reduce risk as much as we can with the time, technical landscape, and resources we have. I felt the RSP as it was drafted was becoming less and less compatible with this. And I want safety measures such as the RSP to avoid imposing large costs for modest risk reduction benefits, as discussed above.
When forcing functions do and don’t work well
Looking for patterns in the above, I’ve come to the take that the best kind of forcing function is one that sets an ambitious but achievable target.
The goal of achieving robust protections against jailbreaks was ambitious, but (at least, it turned out) achievable. The goal of achieving “ASL-3 security” had different interpretations, one of which was achievable but not very ambitious, and another of which was ambitious but not (IMO) reasonably achievable. The goals associated with CBRN-4 and AI R&D-5 were not close to achievable.
These goals hadn’t been set based on ambition + achievability—they had been set based on abstract views of what risk mitigations would make risk low for the threat models in question. Speaking for myself, the reasons I had been part of this approach to setting goals were:
At the time of first promoting RSPs, I collaborated with METR on proposals to make commitments along these lines.
I figured that (a) companies knew better than I did what was achievable and would push back on what wasn’t; (b) what seemed unachievable might turn out achievable, given uncertainty about timelines to powerful AI and the rapid pace of progress in the industry; (c) to the extent some things seemed robustly unachievable (I did think this about state-actor-proof model weight security assuming very short timelines to transformative AI), it was still good to start with a consensus that they were desirable, start working toward them, and eventually “figure something out” if we ended up with short AI timelines and no regulation that could enforce the RSP vision unilaterally.
What I didn’t expect was that RSPs (at least in Anthropic’s case) would come to be seen as hard unilateral commitments (“escape clauses” notwithstanding) that would be very difficult to iterate on.
An additional challenge of forcing functions is that of setting robustly beneficial goals. It sounds great to achieve high model weight security, but is this more important than the many other things that can be done with that energy, such as a less impressive but broader level of security against many threats? Being highly robust to jailbreaks on a limited set of topics is good, but is it better than being somewhat robust to jailbreaks on a wider set of topics?
I think that publicly setting ambitious, achievable, and robustly beneficial goals can provide powerful forcing functions to get big risk-reducing things done. But when goals aren’t achievable or aren’t robustly beneficial, they can distort prioritization and risk assessment and end up being forces for harm.
As discussed below, RSP v3 has involved a lot more attention to how to make commitments ambitious, achievable, and robustly beneficial. I don’t think this is compatible with the old RSP’s approach of deriving our goals from the abstract question of how to keep absolute risk at low levels.
Goal 2 (testbed for practices and policies that can feed into regulation)
This section will be shorter, since I’ve covered a lot of the key dynamics above.
“Have an RSP-ish policy” seems like a substantial chunk of the content of a lot of the most promising regulation that has passed, or gotten close to passing.
However, I don’t think we’ve gotten close to regulation that requires specific ambitious risk mitigations, pausing under certain circumstances, etc.
I think of this as a special case of a broader principle that seems true in this political environment (though not necessarily durably true): it will be much easier for regulation to require practices that AI developers already carry out than practices that no one yet carries out. My observations from policy discussions suggest that relevant policy actors care enormously about what safety-related practices are feasible without unduly hurting the progress and speed of the AI industry.
This may change, and as such, I consider it important that RSPs continue to articulate recommendations for industry-wide safety, not just list the practices that an AI developer can commit to unilaterally.
But I think RSPs could do much more to improve policy if they did more to get companies to do specific good things for risk reduction. I believe there are many such things that are possible. I also want RSPs to continue to lay out ambitious industry-wide recommendations. But I want them to be better set up than they are to get more tangible risk mitigations in place at frontier AI developers.
Goal 3 (working toward consensus and common knowledge about AI risks and potential mitigations)
My sense is that RSPs have had some success here. Improvements by AI systems on capability evaluations have been accompanied by acknowledgements of greater risk and commensurate increases in safeguards investment, not just by new capability evaluations. They have also led, at least in Anthropic’s case, to more published content on risk assessment, which I think has been valuable internally as well as externally.
When Anthropic activated ASL-3 protections, it wanted to have a writeup for both its board and the public explaining the basis on which it believed it met the ASL-3 standard. This led to the ASL-3 safeguards report.
I worked on this report and thought it was a helpful exercise. Just the act of writing everything up required gathering information from a lot of sources and putting it all in one place, and it felt like it put us in a much better position to understand where our biggest weaknesses still were and what the most load-bearing parts of our risk reduction case were. Since then, I’ve seen many decisions about when to grant classifier guard exemptions, change how classifier guards work, etc. People have often raised a point along the lines of: “If we change this, we’ll have to explain why in our next update on that report.” This seems like a nice dynamic that pressures us to generally avoid risk-increasing changes, without completely tying our hands in any individual case when something we’ve been doing no longer seems best for balancing risk reduction with business needs.[4]
RSP v3’s attempt to amplify the good and reduce the bad
The old RSP map between “capability thresholds” and “required safeguards” is, in some sense, trying to do three things at once:
It’s laying out a set of recommendations for what it would take to make AI safe worldwide.
It’s also creating a framework for public risk assessment, e.g. in the ASL-3 safeguards report.
It’s also trying to create a forcing function for achieving risk mitigations with urgency and focus. (Since it is not only a set of recommendations for industry-wide safety, but also a plan for an individual company.)
In working on RSP v3, I tried to separate these three things and do a better job of each.
#1 is now addressed by the “recommendations for industry-wide safety” section of the RSP. It is now written explicitly as a set of industry-wide recommendations rather than as a single company’s plan. The move away from implied unilateral commitments to “pause AI development/deployment as needed to keep risks low” is the biggest change of RSP v3; as discussed above, I don’t think unilateral commitments were ever a good idea, and I wish that the original RSP had not given as big an impression as it did that it represents unilateral commitments.
#2 is now addressed by Risk Reports. These have various new features that I think are positive, such as putting all of the company’s models (including internal-only models) in scope for risk assessment; pulling together multiple risk-relevant pieces of content in one place; and moving toward external review (the last of these seems both potentially very high-impact and quite experimental at this point, and we’ll probably have to iterate on the idea).
Another point of improvement here, IMO, is that it will be easier to be honest in Risk Reports when we haven’t set up excessive/unreasonable self-imposed consequences for coming to certain conclusions. Rather than a “safety case” whose premise for existence is that we have met a particular standard, a Risk Report is simply supposed to characterize the current level of risk, whatever it is.
#3 from the above list is now addressed by our Roadmap. I think this is a big improvement.
When working on our Roadmap, we put significant time and energy into finding a balance between ambition and achievability. For everything we wanted to do, we asked questions like “Who needs to be involved for this to work, and do they think it’s doable?” and “What are some worlds in which we might wish we hadn’t committed to this, because some other use of resources/energy is better for risk reduction?”
I think the result looks more like something that (a) reflects an effort to find the best combination of importance+tractability for our risk reduction efforts; (b) can serve as a true forcing function rather than as an abstract, intimidating ideal.
There’s been a fair amount of discussion (particularly within the company) about whether the new Roadmap will be as powerful as the old commitments, since we’ve dropped the idea that we would delay AI development and deployment as needed to hit our risk mitigation targets. While this is very much an open question, I feel relatively optimistic about the delta:
It doesn’t seem to me that the old RSP actually made people sacrifice all other objectives in favor of the risk mitigation goals (nor do I wish it had). I think most of its “forcing function” juice has come from making it common knowledge at the company that goal X is a publicly declared goal, supported by leadership, such that any action or inaction that endangers hitting that goal hits friction and potential escalation.
I think something like “the company will be embarrassed, leadership will be annoyed and performance reviews will reflect this” still provides a strong incentive to get things done. Not as strong as “the company may have to stop training new models,” but I also think there are drawbacks to overly strong and rigid incentives, discussed above. I fear that it’s easy to set a risk reduction goal that sounds good at the time, but turns out looking a lot dicier later (more costly than expected, less promising than other goals, etc.)
The new RSP strikes what I see as a more reasonable balance between flexibility and rigidity here: “These roadmaps are subject to change. Some changes may simply reflect our evolving understanding of how best to mitigate key risks. However, we will strive to avoid situations where we revise the goals in a less ambitious direction because we simply can’t execute.”
Even if the forcing function becomes weaker due to a less stringent “or else,” I think it will also become stronger due to the goals being more specific and realistic than the goals laid out by the old RSP. One analogy: if I’m running a 5K, I get a faster overall time if I’m trying to set a personal record than if I’m trying to set a world record.
I’ve put a lot of effort into trying to help set things up so that the Roadmaps serve as a powerful forcing function. But I can’t be confident that it will work out. Time will tell.
Do these benefits apply only to the most safety-oriented companies?
Above, I’ve stressed the value of having flexibility to do the right thing. But with flexibility to do the right thing comes flexibility to do nothing. RSP v3 says we have to publish Risk Reports and Roadmaps, but it doesn’t force us to make either one good. What happens if other companies adopt similar RSPs?
I broadly do hope that other companies adopt similar RSPs.
I do not believe that any frontier AI company will actually unilaterally pause or slow AI development (by a significant amount) on the basis of this sort of policy, so I think the downsides of their admitting as much are limited (note that I address the impact of this change on potential regulation below). And I think there are serious downsides to postponing this reckoning. I think the reframe lowers the risk that they will (when the time comes) just be saying “Wow this policy is ridiculous, let’s toss it out and remove any mention of this set of ideas for AI safety.”
I think it would be great if other companies published Risk Reports and Roadmaps and pursued external review.
I don’t think anything in the text of the RSP forces these documents to be good, but it doesn’t have to. Outside observers can compare different companies’ actual Risk Reports and actual Roadmaps, not just the text of their RSP-like policies. If other companies commit to producing these documents, and if Anthropic does a good job with ours, there will be pressure for other companies to make theirs good. How much pressure there is depends on the general level of concern for AI risks.
A revised, but not overturned, vision for RSPs
My picture of the goals of RSPs has evolved since what it was in 2023, but it’s still heavily overlapping.
I am still excited—actually, more excited—about goal 1: create forcing functions for AI developers to move with urgency and focus on risk mitigations. I’ve seen enough “proof of concept” for RSP effectiveness here that I want to see how much better we can make it. I think our Roadmap is better set up to find a balance of ambition and achievability than the previous approach.
I am still excited about goal 2 (testbed for practices and policies that can feed into regulation), but the emphasis has shifted for me. I’m thinking of this less as “If companies agree that an ambitious safety regime would abstractly be desirable, maybe it will increase the odds we get that regime,” and more as “If companies do more and more risk-reducing things that don’t slow them down, then more and more of those things can become required by regulation over time.” But I haven’t dropped my interest in the former, and the “recommendations for industry-wide safety” are still in there.
And I am excited about systematic public risk assessment and other aspects of goal 3. I think the new RSP is a step up in ambition on this front.
I don’t (and never did) believe RSPs can, on their own, get us to a world where AI risk is very low. But I think they can make us safer.
Q&A
On the move away from implied unilateral commitments
Is RSP v3 proactively sending a “race-to-the-bottom” signal? Why be the first company to explicitly abandon the high ambition for achieving low levels of risk?
If this “high ambition” means implied unilateral pausing commitments, then I don’t think these are doing much good, and I think it’s better to make a change sooner rather than later.
I think the specific problems with this setup do vary somewhat depending on what kind of AI developer we’re talking about.
For a company like Anthropic that (I claim) has a large number of people making a genuine effort to do the best thing for risk reduction, it warps our planning and risk assessment (as discussed above) and is a bad idea on the merits.
If a company has little intrinsic interest in risk reduction and just made the commitment to look good, I think the commitment is likely having little to no effect, and at the point where it would clearly bite, it will likely simply be abandoned (with an attendant loss of credibility for RSPs and safety people generally, since I expect general audiences to consider it unreasonable once its full implications are understood).
Either way, to the extent RSPs are seen as unilateral commitments, I don’t think this is something anyone should be trying to preserve. I’d rather move toward a world where RSPs have more impact on risk reduction, as outlined above.
How sure are you that a voluntary industry-wide pause can’t happen? Are you worried about signaling that you’ll be the first to defect in a prisoner’s dilemma?
The RSP revision is not Anthropic saying “We will go ahead with AI development and deployment regardless of risk profile.” It’s removing unilateral commitments and giving more flexibility. If Anthropic does move forward with high-risk AI systems (something it has not yet done, IMO), it will need to document this thinking in the Risk Report and specifically discuss “what we know about how our current and future model capabilities and risk mitigations compare to those of relevant competitors” and “the steps we took to raise public awareness of the relevant risks and to encourage appropriate regulatory action, including our engagement with policymakers and other developers.”
I wouldn’t want Anthropic to move forward with high-risk AI systems in a situation where an industry-wide pause or slowdown (voluntary or otherwise) looked like an alternative. I just also don’t want Anthropic to be pre-committed to a pause regardless of that.
With that said, I will also own that I strongly think today’s environment does not fit the “prisoner’s dilemma” model. In today’s environment, I think there are companies not terribly far behind the frontier that would see any unilateral pause or slowdown as an opportunity rather than a warning. This post isn’t where I want to name names, but it seems clear to me that this is true of at least some companies. This could change in the future.
How sure are you that you can’t actually sprint to achieve the level of information security, alignment science understanding, and deployment safeguards needed to make arbitrarily powerful AI systems low-risk?
It’s always hard to predict what could happen, especially if we can develop some intermediate powerful-but-safe AI that massively boosts our efforts on these fronts, but I’m confident that we can’t be confident enough to have unilateral commitments in place.
If model weights became a top target for the best-resourced state actors, the measures needed to make theft difficult would be extreme, and seem incompatible in any near term with being a high-velocity AI development company. RAND’s paper on securing model weights is a good reference for what this would have to look like (and specifically states: “Achieving SL5 [its term for this level of security] is currently not possible. Realizing all SL5 measures will likely require assistance from the national security community.”)
Then there’s the question of how to make risk very low from misaligned power-seeking from potentially superintelligent AI systems:
We have very high uncertainty about how big these risks are and how we might reduce them.
In some cases, key risks may not emerge until AI systems become highly advanced, making it hard to study and remediate them in advance.
Much of the work we’re doing has large exploratory components, and it’s very hard to say whether it will succeed.
There is a real chance that we will end up with a very strong understanding of the risks and how to prevent them, and one way this might happen is via AI itself greatly accelerating our safety research. However, there is also a real chance that we’ll land in a spot where our understanding is deeply limited, and all we can make is an educated guess about the level of risk.
What message will this change send to regulators? Will it make ambitious regulation less likely by making companies’ commitments to low risk look less serious?
I think the effects here aren’t super clear.
On one hand, perhaps this change will cause regulators to think: “Companies still say the same things about what it would take to make AI safe, but they aren’t planning to do those things themselves, so this is just cheap talk—not something we should actually build policies around.”
On the other hand, perhaps the change will cause regulators to think, “What Anthropic (and potentially others) is saying is that it cannot keep risks low without help from regulation. We can’t leave this up to the companies. The current transparency requirements aren’t enough.”
It’s significant to me that the above points both seem mostly true. I believe “Companies … aren’t planning to [unilaterally] do those things themselves” is true regardless of whether this change goes through, and so I don’t see a lot of value in avoiding regulators’ coming to see this. “We can’t leave this up to the companies” is also true.
Overall, the message this sends regulators seems true, and that seems like a point in its favor. I wouldn’t be excited to continue eating the daily costs of the old RSP for the sake of avoiding sending messages that are substantially true.
Why did you have to do this now—couldn’t you have waited until the last possible moment to make this change, in case the more ambitious risk mitigations ended up working out?
I consider this change urgent, and I wish it had been a lot faster.
The negative effects of the old RSP, discussed above, have been distorting our work on risk mitigations and risk assessment on a daily basis.
I believe the new practices around Risk Reports and Roadmaps (which, as I discuss below, I consider partially-to-fully incompatible with the old RSP) are valuable, and I’d like to start the slow process of aiming for them to evolve and mature to the point of ecosystem-wide adoption.
I think a “last possible moment” change would potentially send a much worse race-to-the-bottom signal than a principled revision (more in the next section), and I think the “last possible moment” could be relatively soon.
Could you have drafted the new RSP, then waited until you had to invoke your “escape clause” and introduced it then? Or introduced the new RSP as “what we will do if we invoke our escape clause?”
I don’t think this would have obviously been the wrong call. But I ultimately lean against. Here are some drawbacks I see to this approach:
It seems like a recipe for dysfunction to have two incompatible plans for future risk mitigations—the “ambitious but achievable” plan of our Roadmap on one hand, and the risk mitigations of the old RSP on the other. In order to avoid confusion, we would have had to state very clearly and loudly to the company that the Roadmap, not the RSP’s listed mitigations, should be guiding our day-to-day work on improving risk mitigations, which seems like it would have undermined a lot of the potential benefits of delaying the change.
It seems like a lot of people imagine that invoking the “escape clause” would lead to a productive moment of public alarm. I feel the opposite way, due to the “grey zone” problem mentioned above. I worry that this approach would end up communicating to other AI developers: “RSP commitments can be scrapped as soon as they’re sufficiently inconvenient, and invoking the ‘escape clause’ doesn’t actually result in anyone being concerned or doing anything” [this is what I’d expect to happen if the clause were invoked in an environment similar to today’s]. I prefer the message sent by a principled revision.
The new Risk Reports and Roadmap are nice, but couldn’t you have put them out without also making the key revision of moving away from unilateral commitments?
I think the Roadmap is incompatible with RSP v2. Our teams need to know what risk reduction work to prioritize—work based on an “ambitious but achievable” ethos or work based on a “whatever would make risks very low” ethos. As discussed above, I think the latter is distorting.
Risk Reports don’t have the same level of incompatibility with RSP v2, but I think RSP v2 would subject them to distorting pressures: we’d constantly be needing to argue either that our AI systems are below the relevant capability thresholds, or that our risk mitigations are in line with the associated requirements, to avoid dire (for Anthropic and for the world, IMO) consequences. I would not be excited to work on Risk Reports under such circumstances.
Why isn’t a unilateral pause a good idea? It could be a big credible signal of danger, which could lead to policy action.
I think there are some circumstances in which this would be true, but others in which it would be backwards.
On one hand, unilaterally pausing or slowing down could be a credible signal of seriousness. On the other, a regulation-backed pause or slowdown could be more likely if others don’t feel they are on track to win in the absence of such a thing. There is a significant degree to which the leader in a race is the most credible party to offer or advocate a draw (and to which, more broadly, parties are more credible when their prospects for winning are better).
In today’s political environment, I expect that unilaterally pausing and trying to “raise an alarm” would simply make us look deluded and over-alarmist, while galvanizing other AI developers (and probably improving their fundraising and recruiting prospects).
A unilateral pause may turn out to be a good idea in the future. But it doesn’t seem wise to be committed to one, especially not at the specific point where we first come to believe our systems are dangerous (which I now believe is unlikely to be near the point where we can convince others of this).
Could a unilateral pause ever be a good idea? Why not commit to a unilateral pause in cases where it would be a good idea?
Yes, I think a unilateral pause could be a good idea, both on consequentialist and non-consequentialist grounds, under various future circumstances that seem like they could plausibly come about. I have a private document on what sorts of circumstances I think would call for this, which I may adapt into a public piece later.
But as discussed above, commitments are risky, and the prudence of making one depends a lot on how confident we are in being able to articulate a robustly good one. I have very low confidence in being able to articulate robust, operationalized circumstances under which a unilateral pause is a good idea. A lot of the key factors come down to things like “What is the political environment?” and “What evidence of risk can we provide, and who will and won’t find it compelling?”
Why didn’t you communicate about the change differently? I’m worried that the way you framed this will cause audience X to take away message Y.
I got a lot of questions of this form while getting private feedback on previews of the new RSP and associated materials, and I’m not going to exhaustively go through each version of the question.
There are a fair number of different audiences that matter here, and for each audience, I think there is quite a lot of variety in how different people model that audience. Communicating with many audiences (many of which are going to engage relatively lightly) is a challenging task, and any path we chose would have had many potential drawbacks and (IMO) made many people feel that we were communicating the wrong message to some important audience.
We did our best, but overall I make no claim that we avoided all risks of “audience X takes away message Y when it would’ve been better for them to take away message Z.” What I have tried to ensure is that:
Our messaging is honest: it accurately and fairly communicates what’s important about the new RSP, and why we’re making this change.[5]
This post provides a detailed accounting of my thinking for high-engagement readers. If you’re reading it, you’re probably in the target audience for this post, and not in the target audience for most other public communications around the new RSP.
Why don’t Anthropic’s and your communications about this have a more alarmed and/or disappointed vibe? I reluctantly concede that this revision makes sense on the merits, but I’m sad about it. Aren’t you?
I am sad and disappointed that there seems to be so little interest in the kind of AI regulation I’d consider sufficient to keep risks low. I had definitely hoped for a different trajectory here, something more like “a consensus builds, helped along by RSP-like policies, that AI should be subject to the kind of regulation that can keep risks low.”
But I think this update is pretty old news at this point, and I am excited about the new RSP. I am excited about the Roadmap, the Risk Reports, the move toward external review, and unwinding of some of the old requirements that I felt were distorting our safety efforts.
I think the first version of most things is pretty bad; I think the RSP has had some impressive impacts and some bad ones; and I think the revision is a step toward amplifying the good, ameliorating the bad, and ending up with lots of risk reduction.
(Repeated from above) This revision isn’t being prompted by “catastrophic risk from today’s AI systems is now high” (I don’t think it is), or by “We’ve just realized that strong global policy action isn’t looking likely” (I think this is not a very recent update). First and foremost, in my mind, it is about learning from design flaws and making improvements.
On other components of the new RSP
The new RSP’s “commitments related to competitors” seem vague and weak. Could you add more and/or strengthen these? They don’t seem sufficient as-is to provide strong assurance against a “prisoner’s dilemma” world where each relevant company wishes it could be more careful, but rushes due to pressure from others.
At a high level, I think of this part of the RSP as content that is trying to signal intentions rather than ambitiously or robustly solve a coordination problem. For the latter, I think we’d need a different sort of mechanism than this policy. As written, I think both the downsides and upsides of this content are pretty low.
In general, I think it’s bad to make commitments that we can’t be confident are good ones. It’s very hard for us to know the status of our competitors’ risk mitigations, and even to a significant extent their model capabilities (especially the capabilities of models that are being used internally, in the midst of training, etc.) A commitment like “we will pause if we don’t have affirmative evidence that our competitors are doing X” could put us in situations where we’ve committed to something very costly that nobody actually thinks is a good idea for the company or the world, which is something I’ve discussed the importance of avoiding above.
After seeing the effects of an overly rigid RSP, I’ve generally wanted to update in the direction of “fewer commitments where we don’t have a robust understanding of what they could mean for us; more flexibility; try to get companies to compete on the content of things like Risk Reports and Roadmaps rather than on the text of their policies.” (That’s for this iteration—a future version may aim to tighten more if the old problems seem mostly gone.)
That said, I don’t think these commitments accomplish nothing. For example, if we achieved strong assurance against catastrophic harms from misalignment, I expect we would be quite public about this, and to the extent we couldn’t give strong evidence in public we would offer to share it in private with relevant parties. If all relevant companies approached the overall matter similarly to us, I expect these sorts of commitments would end up being a significant forcing function for avoiding a race to the bottom.
Why is external review only required at an extreme capability level? Why not just require it now?
I see external review as both high-stakes and experimental, a tough combination.
It’s high-stakes because it amounts to giving a third party the opportunity to review highly sensitive information and the credibility to opine publicly on private information about our risk management. If they do so unreasonably, we have relatively little recourse: they have significant credibility due to the combination of our choosing them and their lack of the problematic incentives we have, and it’s hard to argue in public about sensitive private details.
It’s experimental because we have close to no experience with it to date (beyond this partial pilot covering one of our four threat models).
We want our external reviewer to meet a high bar of impartiality (not just “no equity in Anthropic” but a broad lack of any connections that could present a conflict of interest in reality or perception). This could be challenging given that a lot of our employees are socially integrated into the AI safety community, and given that these employees could be a major source of donations for the kinds of nonprofits that could be potential external reviewers. Government agencies are unlikely to want to publicly opine on specific companies’ risk practices, and we’re still evaluating these sorts of potential conflicts of interest for potential private partners. As such, we’re still figuring out what organizations could ultimately serve this function, though there are credible candidates.
Despite all of this, we are aiming to do regular external review asap, starting with our current risk report (where the process is underway). We just haven’t committed to do and stand by these reviews until a very high capability threshold.
The basic idea here is that due to the high stakes and experimental nature, committing to do this later puts a lot of pressure on us to get experience with it now—and sends a signal to potential external reviewers that the demand is going to be there. I think those are the most important things to be accomplishing now, and so I’ve felt comfortable erring on the side of a late commitment that may not require external review as soon as optimal (but is less likely to land us in a situation where we regret committing to it because no suitable party is available).
All of that said, I think the threshold for requiring external review would ideally be somewhat lower/earlier, and the reason it is where it is partly just reflects our time constraints—refining that threshold isn’t where we chose to spend the energy this time around, and we may do so in a future iteration (e.g., an RSP v3.1 or something).
The new commitments are mostly about Risk Reports and Roadmap—what stops companies from just making these really perfunctory?
The RSP text doesn’t do a ton to ensure these documents will be good, and I think that’s OK.
In addition to publishing the RSP, we’ve also published our first Risk Reports and Roadmap, and I think they are in fact good. (I welcome feedback on these, of course!)
If other companies decide to imitate our RSP, I hope they will feel pressure to match the quality of our actual Risk Reports and Roadmap, not just the text of our policy.
In general, I’ve been updating away from a model like:
Companies ‘race to the top’ (competing with each other on safety measures) on the text of their RSPs
-> the text of the RSPs makes them more likely to do good things, regardless of other things going on
…and toward a model like
Companies adopt RSPs
-> then some of them do good things that are in the spirit of the RSP, because they have truly bought-in employees
-> then all companies “race to the top” to match the visible aspects of those good things.
Why isn’t the RSP more “adversarially designed” such that once a company adopts it, it will improve their practices even if nobody at the company values safety at all?
I don’t think this has ever been a good goal for RSPs. I don’t think there is good enough understanding of risk assessment and mitigation to be sufficiently detailed and prescriptive for this goal, without running into the pitfalls of overly rigid commitments discussed above.
RSPs rely for most of their impact on companies’ having some people who take them seriously. It’s possible that external pressure can lead companies whose leadership doesn’t take AI risk seriously to hire individuals who do; this requires pressure that is sensitive to companies’ actual practices, not just the text of their policies.
(That said, I do hold out some hope that companies’ simply stating what it would take to make AI safe could be helpful for regulation, if the political environment changes. This doesn’t rely on any kind of “adversarial design” approach.)
What are the consequences of missing your Roadmap commitments? If they aren’t dire, will anyone care about them?
I discuss this above.
OK, but does that apply to other companies too? How will Roadmaps force other companies to get things done?
Why aren’t the “recommendations for industry-wide safety” more specific? Why is it built around “safety cases” instead of “ASLs” with specific lists of needed risk mitigations?
My current view (consistent with our current usage of “ASLs”) is that making specific lists of risk mitigations is most sensible for cases where we have a pretty good understanding of what those lists should look like. I think this is currently at least moderately true for “ASL-3” protections (which are actually in place), but I think there’s much less we can usefully say about what it would/will take to provide high assurance against risks from higher AI capability levels listed in our recommendations for industry-wide safety. Over time, we may be able to say more, and I’d want to revise the recommendations accordingly.
I want the “recommendations for industry-wide safety” to be something that I (and Anthropic) could stand behind—as “we think governments should actually put this framework in place and enforce it”—if there were a surge in political will and people were asking us what regulation would make risks low. This doesn’t currently look like “Use X alignment technique and Y information security controls,” it looks more like an FDA-inspired regime in which AI developers have flexibility to make a case that risks are low (which must address certain topics). The “recommendations for industry-wide safety” in our RSP are the best I would be able to offer, at the moment, in such circumstances.
What is the point of making commitments if you can revise them anytime?
I first started pushing for the move to RSP v3 in February 2025. It’s been a very long and effortful process, and “we can revise anytime” doesn’t seem remotely accurate as a description of the situation.
I’d like there to be some friction to revising our RSP, though at least somewhat less than there was for this update.
In my ideal world, revisions like this would get significant scrutiny, focused on the question: “Are these changes good on the merits?” But people would not start from a strong prior that the mere fact of loosening previous commitments is per se bad.
RSPs were introduced when apparent prospects for aggressive regulation were roughly at their peak—a handful of months after Yoshua Bengio, Geoffrey Hinton and many others had first publicly spoken out about catastrophic risks from AI, and shortly before the UK AI Safety Summit would feature King Charles comparing AI to the invention of fire and a US Executive Order on AI would go out. Extrapolating the progress from that 2023 would’ve given a very different expectation for 2026 from the picture we see today, IMO. ↩︎
I concede it’s not obvious that this was better than what else the researchers would have done with their time, although I generally have a bias toward being a fan of projects that have tangibly useful outputs, and given how relatively quick this one was, it seems like a pretty good thing to have prioritized. ↩︎
Of course, this isn’t the only case of the company successfully executing on some goal that requires participation and buy-in from many parts of the company. But the goal is somewhat odd compared to the kinds of goals that businesses usually prioritize, and my strong impression is that the RSP is the reason it got prioritized the way it did. ↩︎
In general, I think Anthropic should balance risk reduction with business needs using a more “normal” framework than “Business needs have approximately no value compared to risk reduction.” If we took the latter attitude, I think we wouldn’t be competitive and whatever practices we implemented would not be well-suited to adoption by other AI developers. I endorse the content of section 6.2 of our Risk Report that spells this out more. ↩︎
Different people tend to have different stories of “what changed,” consistent with having had different understandings of the goals of RSPs as the time the original one was adopted. This post presents my take on what changed, and the Anthropic blog post presents the company’s (leadership’s) take. I believe both are honest. ↩︎
- 's comment on Responsible Scaling Policy v3 by (EA Forum; 26 Feb 2026 3:52 UTC; 70 points)
- Worlds where we solve AI alignment on purpose don’t look like the world we live in by (EA Forum; 20 Mar 2026 14:46 UTC; 70 points)
- 's comment on Michaël Trazzi’s Shortform by (19 Mar 2026 18:18 UTC; 45 points)
- AI Safety Has 12 Months Left by (5 Mar 2026 16:37 UTC; 42 points)
- 's comment on Michaël Trazzi’s Quick takes by (EA Forum; 19 Mar 2026 17:35 UTC; 38 points)
- Was Anthropic that strategically incompetent? by (15 Mar 2026 20:11 UTC; 14 points)
- Monday AI Radar #15 by (3 Mar 2026 5:23 UTC; 13 points)
- 's comment on Anthropic’s leading researchers acted as moderate accelerationists by (18 Mar 2026 7:11 UTC; 5 points)
- 's comment on Anthropic Responsible Scaling Policy v3: Dive Into The Details by (4 Apr 2026 2:50 UTC; 3 points)
- 's comment on An Alignment Journal: Coming Soon by (5 Mar 2026 20:11 UTC; 2 points)
- 's comment on Eli Fulmen’s Shortform by (25 Feb 2026 23:06 UTC; 1 point)
I overall appreciate this post a lot.
I’ll start by commenting on some random sections that I had thoughts on while reading. My high-level thoughts are in this other comment. For readers of this comment, feel free to skip any section, they are largely intended to stand on their own.
Yes, Anthropic employees on more than a dozen occasions told me that the RSP binds them to a mast. I had many very explicit conversations with many Anthropic employees about this, because I was following up on what I thought was Anthropic violating what I perceived to be a promise to not push forward the state of AI capabilities, which many employees disputed had happened.
In almost every conversation I had with Anthropic employees about this, they explicitly said that in contrast to any potential vague promises that were made about not pushing forward the state of capabilities, that the RSP was very much such an explicit commitment, and I should absolutely hold Anthropic accountable to the thresholds outlined in that RSP.
Here for example is a conversation with Evan Hubinger. The conversation starts with someone saying:
Evan responded with (across a few different comments, emphasis mine):
This was, in my experience, routine[1]. I therefore do see this switch from “RSP as concrete if-then-commitments” to “RSP as positive milestone setting” to constitute a meaningful breaking of a promise. Yes, the RSP always said in its exact words that Anthropic could revise it, but people who said that condition would trigger were frequently dismissed and insulted as in the comment above.
And to be clear, I think this is a huge deal! My experience interfacing with Anthropic on RSP-adjacent topics has been pretty universally terrible, with a very routine experience of being gaslit (with the exception of interfacing specifically with you, Holden, on this topic, where your comments have seemed clear and reasonable and consistent across time to me).
I am glad to see this post as a kind of reckoning with many of these bad implicit promises, but at some point Anthropic has failed so many times to set reasonable expectations, and has acted so many times adversarially to people trying to get clarity on commitments, that it becomes very hard to have any kind of non-adversarial relationship to it. I do think this post helps, and I hope it might open up better and less adversarial future communications.
My current belief is that the state of jailbreaking is “basically all models can be jailbroken with roughly 10 hours of effort”[2]. I do think this maybe satisfies the ASL-3 commitments made, but it seems 75%+ likely that jailbreaking will continue to be at most a moderate obstacle. In other words, anyone serious about building biological weapons currently is at most minorly inconvenienced by the need for jailbreaks (10 hours is an extremely small fraction of the time spent on executing any such project).
I feel like the right attitude towards the whole jailbreak thing is mostly to say “yep, we haven’t succeeded at getting jailbreak resistance that makes any substantial difference for any serious attack scenarios, and we are not on track to get there”. As such, listing this in the list of successes of the RSP seems weird to me.
Also, I might be wrong about jailbreak resistance! I have maybe spent like 5 hours thinking about this, which is really not that much, and possibly there has been more progress behind the scenes than I thought.
At various events I was at, and conversations I had with people, Anthropic employees told me they were aiming to achieve robustness from state-backed hacking programs, and that they were ready to pause if they could not achieve that (as the RSP “committed” them to such things).
As someone with an amateur computer security background, this seemed absurd to me and I consistently told people (both online and at many events I went to) that achieving robustness to state-backed hacking programs was unachievable and extremely unlikely and that we should by default assume that Anthropic will change the ASL-4 security thresholds before they need to meet them (@ryan_greenblatt can I think attest to my repeated frustration on this topic).
This was met with broad dismissal by basically every Anthropic employee I talked to, and many people told me that I must simply be bad at thinking about cybersecurity given that I lack more extensive experience and credentials in the space and that I should defer to Jason Clinton who (I was told) was saying that Anthropic was aiming for that kind of robustness, and that Anthropic was on track to achieve it, and was still committed to meeting ASL-4 security standards before the relevant capability thresholds were crossed.
I don’t really have much more to say besides “I told you so” and “I really don’t know why any of you thought this was achievable”. Like, my complaint at the time wasn’t even that I think Anthropic would try too hard to stay competitive and so make bad tradeoffs on cybersecurity grounds. It just seemed clear that really from almost any worldview, the ASL-4 security commitments (and also some of the text of ASL-3 which I covered here) were not achievable and were a clear indicator that Anthropic would have to either implode itself, or drastically change the commitments, which wasn’t in anyone’s interest.
This does seem to me a pretty major strike against Anthropic’s ability to be well-calibrated about its ability to overcome difficult safety-related challenges in even pretty straightforward domains like cybersecurity (which is much easier to be well-calibrated in than e.g. knowing whether models have deep misaligned propensities). Like, the knowledge that cybersecurity has been universally offense-dominant for decades and that practically no large IT project has achieved state-robust cybersecurity is common knowledge in computer security circles, and the experience of going from those circles to talking to what appeared to me at the time vastly overconfident Anthropic employees was stark and suggests some kind of weird filter bubble must have happened.
IDK, I am still not super confident on this, but it does seem that people updated in my direction over the last few years, and my position has for a long time been that Anthropic’s security commitments have been very over-ambitious.
ETA: I do think this also caused substantial strategic mistakes among many people I talked to. A common conversation I had was people saying “it’s really important we beat China”, me saying “I mean, I disagree, but even if I did, China will just steal the weights if you make really powerful AI”, them saying “that can’t be right, Anthropic has committed to being robust to that, and if it’s doable, I highly doubt that OpenAI let their weights be stolen. It’s an ambitious project, but I don’t see a way around it”.
If there had been common knowledge (which I think there probably would have in the absence of Anthropic’s RSP and associated recruitment/marketing/comms efforts) that achieving nation-state robust cybersecurity was not achievable unless very drastic actions were taken, I do think this would have caused some people to change strategies substantially.
Thank you for this section. I would be very interested in seeing you take your private document and making it public. I do think if we want good outcomes from AI, there will be a time when unilaterally pausing is the right choice (my guess is the ideal time was in the past, but I can see reasonable disagreement on this), and so trying to think through this seems very important.
(My broader reflections are in their own comment)
As another small datapoint: Zac Hatfield-Dodds reacted with an “agree’ react to Evan’s comment, indicating he agreed with Evan’s response and framing
To be clear, no one has published any easily developed universal jailbreaks (UK AISI published a procedure that produced one such jailbreak but which required many months of work), but query-specific jailbreaks can seemingly be achieved by anyone who tries medium hard and seem like the relevant comparison here. If universal jailbreaks were easy, then of course such jailbreaks would get published and then the marginal cost for someone to paste in such a jailbreak would be close to 0 hours.
This doesn’t seem “consistent across time” to me, given that Holden is the author of a report called If-Then Commitments for AI Risk Reduction
Also, from We’re Not Ready: thoughts on “pausing” and responsible scaling policies:
To me this language seems entirely consistent with my current position. In fact, if I had (at that time) been hoping for a voluntary pause, it would have been a strange choice for me to emphasis “consensus” leading to “regulation” when I could have just said something like “I’m excited about RSPs partly because it seems like people in those categories—not just people who agree with my estimates about risks—should support RSPs. This means that voluntary pausing based on conditional commitments is more likely than voluntary pausing today.”
I believe that all of my past writing on if-then commitments presents the theory of change as running through policy action, and generally discusses “escape clauses” as well, rather than focusing on voluntary action as the theory of change.
I just skimmed the piece, and it does seem consistent over time to me. Eg., under “Potential Benefits”, the piece does not list unilateral pause in case of unmitigated risks.
That’s a good point. It does say:
This isn’t explicit about a unilateral pause, but I think it would be kinda weird if this meant to imply “And, if needed, we will delay [...] unless other groups are not also delaying.”
Note this quote later in that same piece:
Voluntary commitments and even regulation could be too hard to enforce across the board—such that responsible actors end up adhering to if-then commitments, while irresponsible actors rush forward with dangerous AI. One of the challenges with AI is that complete enforcement of any given risk mitigation framework seems extremely hard to achieve, yet incomplete enforcement could end up disadvantaging responsible actors in a high-stakes, global technology race. This is a general issue with most ways of reducing AI risks, other than “race forward and hope that the benefits outweigh the costs,” and is not specific to if-then commitments.
To help mitigate this issue, early, voluntary if-then commitments can contain “escape clauses” along the lines of: “We may cease adhering to these commitments if some actor who is not adhering to them is close to building more capable models than ours.” (Some more detailed suggested language for such a commitment is provided by METR, a nonprofit that works on AI evaluations.)
Just for reference, this framing is what makes me feel fine about things you said on this topic, but not fine about conversations I’ve had with Anthropic employees about this topic in the last few years. My conversations with Anthropic employees did definitely not involve them saying “we are committing to our RSP only if every other company also adopts a similar RSP”.
At the most they were saying “we are going to revise our RSP as we learn more about what an effective RSP would look like and might make changes in-accordance with that”, which is of course drastically different. If the commitment all along had been to “commit to the RSP conditional on other people also committing to equivalent policies”, then the RSP could have said that directly, and the change from an unconditional to a conditional policy is of course massive (and I think the RSP as written clearly was communicating itself as an unconditional policy).
One interpretation for how Holden might have been consistent over time: He did not think that Anthropic should unilaterally pause AI development if other companies race ahead. But he did think the RSP should say that they’d pause when there are unmitigated risks regardless of the context and race-dynamics since saying so in the RSP is a good forcing function for the actual benefits that he wished would follow from it.
(Tbc., I do not know what Holden believed, I’m just constructing a plausible reality)
(Also, even then he at least seems to have changed his mind about whether writing down If-Then commitments is a good idea!)
Yes, in as much as the Anthropic RSP was intended as an implementation of if-then-commitments with specific ifs and thens, then that would be inconsistent. But IIRC Holden didn’t work at Anthropic at the time of the RSP getting developed or adopted, and I didn’t see any writing by Holden about the degree to which he does consider Anthropic committed to these thresholds, or see it as a clear instance of something that follows the shape of what is in that paper.
I was here mostly referencing specific conversations or interactions (in e.g. comment threads and emails) I had with Holden as opposed to others at Anthropic about the RSP.
I don’t really understand what point you think my quote is making for you. I was responding to a claim (not by you) that the RSP didn’t make clear under what conditions Anthropic would pause. But it very much did! It was very clear, right in the text! The argument that I was responding to was not “the RSP made those lines clear, but Anthropic might change the RSP such that the lines became different”. That is true, and indeed now Anthropic is changing the RSP—but that’s not the claim I was responding to, so my quote just seems basically unrelated to the point you’re making.
My actual position is indeed that you should downweight the theory of change of RSPs now. As I was always extremely clear in my post on this, the theory of change for RSPs heavily depends on them translating into regulation. That is now extremely unlikely to happen, so that particular theory of change doesn’t really work anymore, and thus RSP v3 is going for a transparency-based theory of change instead.
Please try a bit harder to engage with me and others here. I am pretty sure you can at least see the shape of the complaint.
The RSP did of course absolutely not make clear under what conditions Anthropic would actually pause[1], because the most crucial condition that in the end actually mattered was “whether Anthropic leadership actually thinks it’s a good idea to pause” (because if they don’t think it’s a good idea they will just change the RSP).
Many people expressed concerns about this! This concern is included in the conversation above, and in my follow-up conversation in the thread.
You have not “committed to pausing when condition X triggers” if you can simply change the commitment any time! That’s not how commitments work.
The appropriate thing to say at the time, given how you seem to be relating to the RSP now, would have been:
But you did not say that! Indeed you really said something extremely close to the opposite of this.
I mean, just imagine this in any other contexts. Imagine I am talking with a user worried about how we treat their private user data on LessWrong:
Them: “hey, an issue with your privacy policy is that it doesn’t actually help me understand how likely it is for my DMs to actually be sent to third parties”.
Me: “What are you talking about? Our privacy policy is very clear that we commit to never send your DMs to third parties. I honestly have no idea how you failed to miss that. It’s RIGHT THERE. This doesn’t seem worth engaging with given that you don’t seem to have even read our privacy policy at all”.
Then a week later I trigger sub-entry 58 of the privacy policy which says I can change the privacy policy any time, and send all of their DMs to random third parties.
The user gets upset and says “you said you committed to never send my DMs to third parties”! I say “I really have no idea how you think that conversation might be relevant. I was responding to a question by you about what our privacy policy says, I wasn’t responding to a question from you about how our privacy policy might change in the future. I changed our privacy policy so it doesn’t say that anymore. My quote about ‘committing to never send DMs to third parties’ seems basically unrelated to the point you are making.”.
Since we now have the technology, I threw all of the relevant comments in these threads into Claude, with just the prompt “Hey, what is your assessment of the situation described in these comment threads?”, and no special memory or system prompt.
This is what Claude said:
I think Claude’s summary is decent-ish here. I think if Claude can see at least the shape of this whole concern, so can you.
There even was an “actually” right there in the original question! IMO the original question was clearly about the degree to which the RSP provides actual guidance about future Anthropic’s actions, not just the narrower question of “under the assumption that Anthropic leadership doesn’t change the RSP, what does it say about Anthropic’s behavior?”. The Manifold market you and I made also does not have any “unless Anthropic changes the RSP commitments” clause in it.
Put all the weight on the word ‘commits’ in the exchange and I think you’ll see Oli’s read! It took me a second to see it, too.
(I don’t understand this comment and don’t currently think it’s a helpful contribution)
Hmmm.. I think I took Evan’s opening line “I don’t really understand what point you think my quote is making” very literally. On rereading, I think Evan does understand the point you think his quote is making, but disagrees that it makes that point (I haven’t spoken with Evan enough to know how literal he is).
(And I just agree with you that if someone uses the word commit, they shouldn’t be locally redefining it to mean ‘commit unless we change our mind’, in particular when responding to a question that opens with the phrase ‘actually commit’.)
Retracted!
The tweet you link to here is targeting a chemical weapon, not a biological weapon, which was explicitly out of scope per footnote 3 on the post on ASL-3 protections at the time. I don’t think sarin synthesis instructions are much evidence either way here.
I would guess that getting similarly helpful uplift for eg weaponizing dangerous pathogens would be significantly harder, especially with the classifiers of 2026 which are a lot better, but I agree the universal jailbreak focus isn’t that well justified for a threat model where misuse actors have in-house jailbreaking effort. Would be pretty interested if you know of evidence that people can get single-topic bio jailbreaks past current classifiers without extremely large amounts of effort!
I thought the above was one such example (and I looked in all the places where it was posted to find someone saying something like that), so maybe I am wrong!
Possible that I am wrong about this. I’ll think a bit more in the following days.
Drake is right, sorry for the confusion. We were not intentionally misleading – we missed a footnote on the announcement when putting together the initial tweet thread that narrowed the claim to just bio rather than CBRN, as discussed in the rest of the announcement. We did later find a vulnerability that allowed us to bypass the filters in the bio setting, reported it, and it was patched. I think that follow-up work took more on the order of 40 person-hours, but was a general method that could extract information in a range of settings. I don’t know how likely it is that there are further such vulnerabilities.
Even if single-query jailbreaking was O(10) hours though, having to send many queries to discover that jailbreak makes it much easier to catch through monitoring.
Do query-specific jailbreaks require knowledge of the ground-truth? If you use that high-profile jailbreak which generated a plausible-sounding sarin recipe[1]---which iiuc worked by asking the model to help the user write a short story/article containing of the diary of a chemical terrorist, or something like that—do you actually get a workable sarin recipe, or do you get a procedure which actually fails/explodes/gets you caught easily because you tripped every single government watchlist filter at the same time?[2] What if you get your jailbreak slightly wrong? In the process of getting the right answer, will you get half a dozen fake answers that you can’t distinguish between?
Some of it seems silly to me just reading it: weeks 1-2? For a chlorination step in an ice bath? That’s like, an afternoon of work, max.
I’m pretty sure that you can’t “just order” the relevant chemicals for sarin, in particular.
This was Pliny’s response when I asked them if they can get around the classifiers. I’m not fully confident this counts, but Pliny seems to think so
Can you expand on what is there in this post that you appreciate?
How do you reconcile that with the fact that Claude has recently been used by the US Government to process classified information? Presumably they have a special version on special servers for that but still, this looks like some degree of robustness which might be achieved with a model not served to a wide audience
I think this is referring to protecting Claude’s weights from being stolen by a state-backed hacker, not about making Claude usable by governments.
I believe these things are connected with each other: if the server and the software system in general is safe enough to work with lots of classified information on a regular basis, it’s safe to store the weights as well
Didn’t top secret US government networks have breaches by the Chinese before when the stakes where probably lower? Are you thinking that those networks are much more secure now than they were a decade ago?
The deployment for top secret government networks probably is reasonably secure. The problem is that they also store those weights in a bunch of other data centers that are necessarily connected to the internet, and the only way to not do that would be to shut down their consumer product and lose 99% of their income.
I agree that the models served to civilian customers over API can’t be realistically secured from the state adversaries, but if we are speaking about advanced AI R&D in the future like in AI 2027, than it looks feasible to conduct it on protected servers. Maybe I misunderstood author’s opinion
No, it’s not at the moment feasible, without building infrastructure that would halt frontier training progress at the relevant company for months, if not years, to keep weights limited to servers with nation-state level robustness.
Nation states steal classified info on a fairly regular basis, right? I’m not familiar enough with the field to have a definite opinion, but it’s not obvious to me that ASL-4 security is achievable with the controls we normally use on Secret or even Top Secret info.
My sense is most efforts which seem to be aimed at nation-state robust cybersecurity for AI (outside labs) have been driven by things like the RAND report and some theory of change downstream of pieces like Situational Awareness which argue that governments may push for TS/SCI classified AI development at some point (for some applications)—not necessarily downstream of Anthropic’s RSP (I’ve never heard anyone mention it directly in, for example, any discussion around why SL5 security is important).
That said (a) I agree with you that this goal is impractial for general AI development and it was foolish of Anthropic to commit to something close to security against state-backed attacks (b) I have heard critique of Jason Clinton’s PoV from parts of the AI cybersecurity community that, having never worked in an Intelligence Community cybersecurity role, he lacks information that would update him on the difficulties (c) I remain confused why some still consider SL5 security for AI model weights a tractable or important goal in the field.
I don’t know what your background is, but to be clear, I am saying that many people who do not work in computer security have made bad strategic decisions because of this. I don’t think almost anyone working in computer security was super misled here.
Maybe our experiences still diverge with that clarification, but it seemed good to check before I dig deeper.
That’s fair! I am mostly thinking of the AI safety community and the parts of it interested in cybersecurity. I had a lot of discussions with people around the funding ecosystem/government AI safety-interested people/AI policy thinktanks around a year ago about the merits of attempting SL5 and never heard a mention of Anthropic’s RSP specifically, although it seems plausible it was a contributing factor for decisions to pursue that direction.
I was thinking here of talking to various people in EA leadership and various other people working in AI safety research (including some people at labs).
I did go to the AI security forum 2 years ago or so and the vibe I got was that people also made some bad strategic decisions at the time vaguely related to this, but people seemed better calibrated than the people who didn’t have any security interests.
(See my other comment for thoughts on various individual sections in this post)
Some broader reflections on the overall RSP situation:
About 3 years ago, when the broader AI-safety community had to make a crucial choice about how to relate to government regulation, how to relate to AI capability companies, and how to relate to AI models really looking like they might soon be very competent, there was a big debate about how we should be thinking about regulation and the associated incentives on both governments and companies.
Many people (me included) said that what we should do is to convince policy-makers that progress is already too fast, these systems might soon be very dangerous, and the top priority should be to directly intervene by slowing down AI capabilities progress. Such regulation would be centered around limiting how much compute individual actors or frontier companies could use for model training runs, as those are the most obvious correlates of risk, or other regulation that would directly and immediately impact operations of frontier companies.
Many others said that it was not the right time to start advocating for a slow down or pause to policymakers. Instead we should centrally focus on getting people to make conditional policy commitments. Current policymakers and frontier company employees are not sold that future AI systems will pose a risk, but that’s fine! We simply need them to agree to make if-then-commitments, where if certain risk thresholds or capabilities are met, then they would commit to slowing down.
Different people went different paths, but most of the ecosystem’s resources went into the latter kind of plan, with the two central pillars being to start and invest in evaluation companies like METR and Apollo (to develop evaluations and capability measurements that could provide the ifs), and work at companies or within governments to develop commitments (the thens) based on the evaluations.
I think we should update very substantially against the conditional policy commitment plan. No company or country signed on to if-then commitments, and indeed, the few that did anything like it regretted doing so (as this post illustrates). There are no clear capability evaluations, and the appetite for conditional risk regulation has been substantially less than the appetite for direct risk regulation (or compute thresholding, which doesn’t require any complicated eval infrastructure).
This is a huge deal! This was, as far as I can tell, the single decision that most affected talent allocation in the whole AI safety community. METR and Apollo and the broader “evals” agenda became the most popular and highest-prestige thing to work on for people in AI safety.
I think this post marks a great time for people to reassess their work and whether they should switch to the other branch and advocate for direct and immediately acting policy commitments, whose basis is not uncertainty about whether systems will eventually pose risks and so will not take into effect in any meaningful way immediately until a future trigger sets in. If-then commitments are dead. There are no ifs, there are also no thens.
My guess is many people will disagree with both the history of what I am saying here (this kind of stuff is often tricky and different people have different social experiences), and also disagree that we should make much of a policy update here. I would love to chat with you! I do think that the whole “let’s focus on conditional policy commitments” effort has been a huge waste of resources, and I wish we had never done it, and I would like us to stop sooner rather than later.
For the record, I think your perspective on RSPs aged better than mine. I plan on writing more about how I was wrong and why.
(I don’t agree with significant aspects/implications of the comment I’m responding to, though I also think it contains important true claims, I’m just making a claim about your takes in the linked post aging better than my takes.)
How I think about METR’s theory of change:
General principles:
- avoid world being taken by surprise by AI catastrophe—improve knowledge / understanding / science of assessing risk from AI systems—
independent/trustworthy/truthseeking/minimally-conflicted expert org existing is good—can advise world, be a counterbalance to AI companies; a nonprofit has slightly different affordances than govt here.
Strategy:
- try to continually answer question of “how dangerous are current / near-future AI systems”, and do research to be able to keep answering that question as well as possible
- be boring and neutral and straightforward, aim to explain not persuade
Some specific impact stories:
- at some point in future political willingness may be much higher, help channel that into more informed and helpful response
- independent technical review and redteaming of alignment + other mitigations, find issues companies have missed
- increase likelihood that misalignment incidents or other ‘warning shots’ are shared/publicized and analyzed well
I think that broad ToC has been pretty constant throughout METR’s existence, but my memory is not great so I wouldn’t be that surprised if I was framing it pretty differently in the past and e.g. emphasizing conditional commitments more highly.
This is a bit of a nit, but I don’t think METR has consumed that much of the “community resources”, especially of more experienced technical talent—I think only around two or three of current employees at METR were working in fulltime roles on technical AI safety before they joined METR. This is a thing I care about and track in hiring—I don’t want to pull people away from doing other good work.
(Edited to add the italics, prev claim was overstatement)
(Second edit: although I still agree with the claim that we haven’t had a large negative impact on talent availability in technical AIS)
Don’t think I agree. When I scan down the staff, I recognize about half the names as having been around the AI safety scene for 4-8 years, either working on projects or seeking projects. You, Painter, Cotra, Filan, Wijk, Becker, Chan, Kinniment, Jurkovic, Von Arx, Kwa, Dhaliwal, Harris, Chen, have all been part of the AI safety community a long time, and would likely be working on another related project if not for this. Perhaps more that I’m not as immediately familiar with.
Added: To cache that out for those who don’t know who these people are:
Beth was an alignment researcher at OpenAI in 2019, and I know cared about this much earlier than that.
Chris Painter’s LinkedIn shows he worked on “AI Safety via debate” at OpenAI in 2019.
Ajeya Cotra has been working in the AI part of OpenPhil/Cog since at least 2018.
Daniel Filan was part of Stuart Russell’s CHAI at UCC Berkeley while he did his PhD there starting in 2016, and has hosted the AI X-Risk podcast since 2020.
Hjalmar Wijk was a MIRI Summer Fellow in 2019, also an FHI Summer Fellow that year, and I suspect was involved with FHI throughout his Oxford CS PhD in that time.
Joel Becker was AI Safety grantmaker with Manifund for 2 years before joining METR.
Lawrence Chan was also part of CHAI under Russell while doing his PhD there since 2018, then worked at Redwood and Alignment Research Center before METR.
Megan Kinniment was a summer research fellow at the FHI in 2020, did an AI project at the Center on Long-Term Risk that year, then continued to work between those two places before joining ARC.
Nikola Jurkovic was a research assistant on AI Safety projects at Harvard while a student there in 2023, did a bunch of work on the AI Safety Student Team, before joining METR.
Sydney Von Arx doesn’t have much online presence, but I know that Beth knows that Von Arx has been working on world-saving projects for a long-time (e.g. she cofounded the Open-Phil-funded Atlas Fellowship) and definitely has oriented to Superintelligent AI as the most important thing in the world since her time as part of Stanford EA.
Thomas Kwa was a MIRI researcher in 2022.
Jasmine Dhaliwal was Open Philanthropy Chief of Staff for a year in 2023, then worked on FutureHouse, “A philanthropically-funded moonshot focused on building an AI Scientist”, before joining METR.
Kit Harris has been around the EA scene as long as I have, so at least a decade. He spent 7 years at Longview Philanthropy where amongst other things he “led grant investigations in artificial intelligence and biosecurity and laid the groundwork for new lines of work at Longview Philanthropy”.
Michael Chen was a research intern at Stuart Russell’s CHAI in 2022, and his METR profile says “Prior to joining METR, he contributed to research studying AI deception and hazardous knowledge in large language models.”
These are just the ones that I immediately recognized, I expect if I went through them all I’d find others have also been substantially involved (both professionally and personally) in the AI Safety scene prior to METR. And I count more than two or three people involved in technical AI safety in the above list.
I was thinking “was working FT on technical AIS before we hired them” more than “was around this space and might have done other AI safety things”—sorry if that was misleading.
1. You can count me although I also think I’m not central example of technical AIS work
2. Chris was mostly working on Alvea and policy stuff before METR, the debate thing was part-time contracting with me and not central example of technical AIS work
3. Ajeya—wasn’t necessarily counting grantmaking but that’s reasonable (also only joined METR very recently)
4. Daniel—was counting but I think not central example of FT TAIS work (also only joined METR very recently)
5. Hjalmar—hired partway through theoretical CS PhD, never had an FT AIS position I don’t think
6. Joel—pretty sure manifund grantmaking was not close to a FT position?
7. Lawrence—was counting
8. Megan—never had an FT AIS position I don’t think
9. Nikola—hired out of undergrad
10. Sydney—wasn’t counting as technical AIS
11. TKwa—not sure, was this FT position?
12. Jas—wasn’t counting as technical. Also I don’t think Future House counts as safety.
13. Kit—wasn’t counting as technical (he has math degree but I think fair to say the longview work is not central TAIS)
14. Michael—never had FT AIS position I don’t think
David Rein who you missed I think is actually the clearest example
More than one of these people were at least temporarily unusually low-opp-cost for personal reasons that I don’t want to go into here (similar in spirit to ‘health/location constraints made it hard for them to have other jobs’)
In my mind there’s a big contrast here vs e.g. Ant, which I think has a huge number of people with multiple years experience working on technical AIS.
E.g., people who I know off top of my head:
Jon Uesato, Jeff Wu, Jan Leike, Chris Olah, Daniel Ziegler, Sam McCandlish, Jared Kaplan, Catherine Olsson, Amanda Askell, Tom Henighan, Shan Carter, Jan Kirchner, Nat McAleese, Carroll Wainright, Todor Markov, Dan Mossing, Steven Bills, William Saunders, Danny Hernandez, Dave Orr, Steven McAleer (all multiple years experience at OAI and/or GDM working on safety teams)
Evan Hubinger, Sam Bowman, Sam Marks, Fabien Roger, Ethan Perez, Collin Burns, Akbir Khan, Tao Lin, Kshitij Sachan (previously working FT on safety in academia or nonprofits)
(I expect I’m wrong about ~2 people in those lists)
There are probably a similar number more I’m uncertain about or are non-central examples like the METR ones discussed above.
I agree with your assessment here, I don’t think METR has had a significant negative effect on the availability of talent in the technical AGI Safety ecosystem, and Anthropic has had a massive negative one. GDM Safety has probably had a moderate negative one, offset by many people preferring to live in London
I see. Yes I think your previous claim was an overstatement.
I also share Habryka’s perspective, I’ve broadly not been sold on technical talent being vastly more important than non-technical talent since MIRI gave up on trying to actually solve the full alignment problem and Christiano stopped working on alignment theory, and I think that many of the people I listed have much more potential to do things that are good than most of the people you listed at Anthropic; but going into more detail on all that would take more time than seems worth it this afternoon.
FWIW I definitely don’t think technical talent is vastly more important, I just assumed that’s the resource that people would most think METR might be a large consumer of given most of our roles are technical roles
I think technical AI Safety work is among the less valuable kinds of work to do on the margin, so I definitely didn’t intend to constrain talent claims to technical AI safety. Indeed, generalist/entrepreneurial/communications talent seems a lot more valuable to me on the margin.
That said I agree that METR did not consume as much talent as Anthropic or OpenAI, and indeed many people went to work there to work on RSPs and similar if-then-commitment stuff, which didn’t pan out (and now my guess is they are very unlikely to leave). But METR + Apollo seem like the runner-ups right after the labs in terms of where people went to work (and at least at the time largely for if-then-commitment-like reasons).
Hm I think UKAISI at least is a lot larger than METR or Apollo?
If you’re focusing more on generalist/entrepreneurial/communications skillsets then e.g. CG has more of these people than METR, I think?
I think this sort of overstates the proportion of effort that went into that kind of work. There was also a lot of work that aimed to develop techniques that reduce or improve understanding of misalignment risk (e.g. Redwood’s stuff).
IMO both METR and Apollo substantially pivoted away from the strategy you’re describing here at least a year ago.
I think “most” is roughly accurate (like IDK, my sense is around 60% of talent + funding was reallocated to plans of that kind). I agree that other people kept doing different things!
I do think there aren’t that many places that do work around reducing or understanding misalignment risk, especially outside of the labs (which I am excluding here).
I am honestly confused what METR’s current theory of impact is.
It seems most effort is going into things like the time horizon evaluations, but it’s not super clear how this translates into the world getting better (though I am generally of the school that helping people understand what is going on will make things better, even if you can’t specify exactly how, so I do think it’s good).
I have been appreciative of METR staff being more public and calling directly for regulations/awareness of the risks, but things still haven’t come together for me in a coherent way, but in as much as METR “pivoted”, I am not quite sure what it has pivoted to.
FWIW, my interpretation of what we should be learning is pretty different here.
I would broadly say that political will for anything in the “slow down AI as needed to make it safe” category has been well short of what many people (such as myself) hoped for. Because of this, some of the core founding hopes of the RSP project look untenable now (although I don’t consider the matter totally settled); but to me it feels like an even bigger update away from “pause now” movements.
I don’t understand why you say this: “the appetite for conditional risk regulation has been substantially less than the appetite for direct risk regulation (or compute thresholding, which doesn’t require any complicated eval infrastructure).” I have not seen roughly any appetite for “compute thresholding” if that means something like “limiting the size of training runs” (I have seen “compute thresholding” in the sense of “reporting requirements triggered by compute thresholds”). I don’t know what you mean by “direct risk regulation”, but if it means regulation aimed at slowing down AI immediately, I also have seen much less (roughly no) appetite/momentum for that, and more for regulation based around things like evals and if-then commitments.
Separately, with the benefit of hindsight, I think a global AI pause in 2023 would have been bad on the merits compared to, say, a pause around when the original RSP implied a pause should happen. The former, compared to the latter, would have meant losing a lot of opportunities for meaningful alignment research and more broadly for the world to learn important things relevant to AI safety, while having almost no marginal catastrophic risk reduction benefit AFAICT.
You may have a view like “2023 was the right time to pause, because it was politically tractable then, but postponing it ensured it would not remain politically tractable.” That would be a very different read from mine on the political situation.
> This is a huge deal! This was, as far as I can tell, the single decision that most affected talent allocation in the whole AI safety community. METR and Apollo and the broader “evals” agenda became the most popular and highest-prestige thing to work on for people in AI safety.
This seems off to me. First, the emphasis on evals predated the idea of if-then commitments, and I think attracted more resources at pretty much every point in time; evals have a variety of potential benefits that don’t rely on if-then commitments. Second, I don’t think most people who work on AI safety work on either of these.
Certainly! I am not saying we are in a world with enormous political buy-in for slowing down AI, though honestly, I think we are kind of close to where my median was in 2023, maybe a bit above (I never expected much buy-in).
Maybe there is some miscommunication here. I personally do not think that a short pause at any point in the last few years would have been useful on its own, and the primary reason why I would have supported one is because it would have made future pauses much more likely (though that of course depends on the implementation details). If you had asked me about ideal timing on when to halt capabilities progress I think I probably would have suggested capability levels roughly where I expect things to be at in early 2027 (which to be clear, 2 years ago I would have predicted to be more than 3 years away). Of course, we are nowhere near coordinating a global pause within the next year, and so the “stop as soon as possible” position seems right to me at this point.
But to be clear, I wasn’t talking about any kind of pause at all in my comment above, so this is all mostly a distraction from the topic at hand. The central policies that I was referring to when I was talking about “direct risk regulation” are things like:
Direct liability for harms caused by AI (which would have functioned as a pretty direct and immediate tax on AI development, conveniently a bit more leveraged on the future and more competent AI systems)
Datacenter construction moratoriums
Licensing regimes (these are tricky and my guess is would backfire for various reasons, but I do think the proposals I’ve seen were motivated by a “we need to directly intervene on how this technology is being developed right now”)
We’ve seen pretty specific proposals for at least the first two, with SB 1047 introducing a bunch of direct liability, and Bernie Sanders and other congress-people advocating for datacenter moratoriums (in many cases downstream of IMO confused environmental harm effects, but mostly downstream of a general “AI is scary and bad” sentiment that seems reasonably calibrated to me, in as much as anything as incoherent as whatever is motivating these proposals could be called “reasonably calibrated”).
The EU AI act also seems to me to be closer to direct risk regulation than conditional risk regulation in that it directly affects the operation of companies as soon as it takes into effect, and involves directly imposing requirements that all frontier model developers will have to adhere to, as opposed to triggering regulation after certain capability or misalignment thresholds are being met.[1]
This stands in contrast to what I have perceived to be lack of any motion at all on if-then-commitments at either the state level or the US federal level, or even the AI company policy level. I have not heard any policy proposals that even pass a sniff-test for what an if-then-commitment at the policy level would look like, and having talked to many other people in policy about this, my sense is no one has actually figured out how to even start having something like a capability-evaluation (not to even talk about a misalignment-evaluation) hook into a policy-making apparatus, which I understand to be the central premise of what if-then-commitments were trying to be.
Just to reiterate again, none of the things I am saying here have much to do with pausing. I think the top priority of regulation was always to slow down things, ideally incrementally, until you have slowed things down so much that you are de-facto pausing. At no point in history did it seem feasible to me to coordinate a sudden pause in 2023, and while I think humanity would have been better off pausing then (by putting us into a much better position to pause in the future), I am absolutely not comparing “if-then-commitments” to “try to make a pause happen in 2023″ which strikes me as a weird strawman, and IMO would have been a waste of time for people to spend much of their efforts trying to make happen somehow.
Datacenter moratoriums, GPU taxation, extensive auditing requirements, partial nationalization, direct liability, GPU import tariffs, or any of the hundreds of tools that governments around the world have availabl to slow down AI progress are the kinds of things I mean, with the central measure of the success of the regulation being “how much are you successfully reducing AI capability growth rates right now, which are already clearly too high, and to what degree are you putting yourself into a position to reduce AI capability growth rates in the future”.
But even beyond that, I think the key thing that people in policy should be doing, and were largely not doing in 2024/2025 due to a focus on evals, if-then-commitments and attempts to influence government policy by focusing on frontier company internal policies, is to directly talk to policymakers and make the case for existential risk from AI. The conclusion that almost any policymaker who I’ve seen seriously grapple with this topic arrives at is that it is paramount to prevent the creation of artificial superintelligence. After that basic case is established, policymakers have many opinions and much motivation for many regulations that could achieve that. In the long run this does require international treaties and diplomacy, but there are many things we can do in the near term, like slowing down GPU investment, or various forms of indirect taxation of frontier companies in the forms of fines or liability or whatever.
I agree that emphasis on evals at e.g. METR predated the focus on RSPs and if-then-commitments, but that was also (if I remember correctly) before evals became one of the hottest things to work on in the broader AI safety ecosystem. My sense is the transition to “evals are the hot thing to work on” coincides closely with the transition to “RSPs are the hot thing to work on”, because indeed, the case for the two was pretty closely entwined. I am not overwhelmingly confident of this, but I remember many conversations with people who were thinking of joining METR and Apollo, and doing grant evaluations of both of those organizations, and in those conversations the RSP case seemed central to me.
After if-then-commitments and RSPs showed themselves as an unpromising direction for policy-interventions, focus among eval orgs (including METR) shifted back towards using evals to inform policymakers and the public about risk, but much of the talent and funding that flowed into those organizations was originally motivated by the RSP/if-then-commitment case (of course, the fact that there was a nearby BATNA if that plan doesn’t work out certainly motivated people to work on it, and I certainly am glad that people considered how gracefully this kind of plan would fail when they decided to join METR and other eval orgs).
My understanding is that most safety efforts at Anthropic were oriented around the RSP in 2024/2025, and my current model is that almost 50% of total safety talent in the ecosystem work at Anthropic. Beyond that, RSP development was (I think) the primary focus of the safety teams that did exist at both OpenAI and Google Deepmind in at least 2024. So I do think that most people who seriously work on AI safety worked on if-then-commitments and RSPs, or at the very least had their work prioritized centrally downstream of efforts to bring by RSPs and if-then-commitments (but I think the stronger thing holds where I do think the majority of the fields efforts went into trying to make RSPs and if-then-commitments happen, not only that their work was structured by RSPs and if-then-commitments).
Of course this is conditional on not including all generic post-training in “AI safety work” which, at least in terms of raw headcount, far exceeds the efforts going into the rest of AI safety. Work on post-training has certainly recruited away many previous high-quality contributors to AI safety, and is sometimes labeled “AI safety” but I think in most cases has little to do with what we are talking about. You might disagree, but I didn’t mean to imply that efforts into evals or if-then-commitments eclipsed broader post-training efforts.
With the exception of compute-thresholds which I think are the only thresholds that ever had much of any shot at being used as the basis for policy and which I remember as being explicitly contrasted with RSP and if-then-commitments, where the latter was being presented as a way of doing regulation that is more sensitive to the risks, and that compute-threshold-based regulation would end up being too restrictive and therefore could not get buy-in
Hm. I had thought you were pointing to something like “There isn’t actually going to be a pause in this environment triggered by if-then commitments” as the main update/vindication of interest; I was basically responding by pointing out that there also isn’t going to be (and IMO there was never a promising path to) a pause in this environment triggered by advocacy for immediate pausing.
Instead it seems like you’re doing something more like “comparing the overall impact of talking about pauses—or, more broadly, existential risks from AI—with the overall impact of talking about if-then commitments.” I think this is a much muddier comparison where there is less clearly a big update to be had.
I don’t think we have seen much traction on attempts to slow down AI in any way. Meanwhile, I do think that the framework of “test for dangerous capabilities and implement commensurate mitigations” has had quite a significant impact on company behavior in a way that does seem to set up many policy possibilities that would otherwise be rough (including much of what has already passed).
The comparison to “raise general awareness about risks of AI” (as opposed to “advocate for specific policies explicitly aimed at slowing down AI”) feels a bit harder to make—certainly I am, and long have been, positive disposed toward raising general awareness about risks of AI.
But I will probably leave that there as it seems like a pretty complex and tricky debate to have.
This doesn’t sound remotely right to me. I would say that the RSP has provided an organizing framework for a lot of safety work, but that’s different from something like “all of that safety work would make no sense if not for the RSP” or something.
Seems reasonable. I’ll leave a few quick clarifications.
The thing that I am comparing is “resources invested into advocating for direct regulation, and actions that would directly slow down AI development” vs. “resources invested into getting companies to adopt RSPs and get policies around RSPs passed”.
I think traction has been not great, but also not terrible. Honestly, I would have been confused if many very concrete useful things had passed by now, since buy-in takes a while to build, but the things that do seem to show motion seem not very RSP-flavored. I do currently think it’s pretty unlikely that regulation that does get passed has much of any grounding in if-then commitments or RSPs, and I am not sure what you are talking about with “set up many policy possibilities that would otherwise be rough”.
I agree and am deeply grateful for your work in the space, and your support of work in the space.
Hmm, I agree this comparison is tricky, and on-reflection I think I overstated the ratios here. The RSP has been responsible for quite a lot of safety-adjacent work (including a lot of effort spend on cyber-security, and various comms efforts, and the prioritization of various mitigations), but I agree that most of the safety-adjacent work at Anthropic is more driven by other risk models (and are IMO mostly downstream of beliefs around what the tractable parts of the general AI alignment problem are, and which which aspects of alignment-oriented work are most helpful for commercialization), and the RSP prioritization I think is probably more responsible for something like 15%-20% of the work at Anthropic.
Where do you see the latter appetite?
We campaigned a bit for a conditional treaty. We’d happily sign up for un unconditional pause though. Problem is: there is no appetite for either, right?
I agree that the manpower spent on evals should have been spent on other things with a better theory of change. Eval quality imo is not a crux for regulation, awareness and political support are. I think the money that went to evals should have gone to raising awareness and lobbying.
Honestly why is there still no significant funding for awareness raising projects? It’s so easy: just ask for amount of views/copies and conversion rates measured via e.g. Prolific surveys and fund the most effective projects. A fund like this can easily absorb millions. I think this might actually get regulation off the ground.
Since very little frontier regulation has passed, the best we have is watching various people in congress who have been concerned about risk speak, and there I’ve seen many more direct risk regulation proposals than conditional risk regulation proposals.
Also in as much as we had any success, establishing direct liability via something like SB1047 was the closest we got, which didn’t have any dependence on evals.
It seems pretty misleading to describe the shift away from unilateral pausing as a natural extension of the RSP being a living document. Of course people expected some changes to occur, but I think these changes were understandably expected to be more of type “adjusting mitigations as they learned more” and less of type “removing if-then commitments entirely.” Indeed, the “if-then” structure was the core safety motivation for RSPs as many understood it—in particular the idea that Anthropic would pause if danger exceeded some threshold—and it was heavily defended by Anthropic employees as such. I highly doubt most people would have predicted that Anthropic would drop this commitment later; I think doing so marks the breaking of a meaningful promise—something many people were relying on, making career decisions on the basis of, etc.
This new policy also seems even less safe to me. For instance, you describe some of the “wrong incentives” the RSP produced:
In other words, because a pause would seriously damage the company, there was pressure to misrepresent the risk. I think this should seriously call Anthropic’s ability to self-govern into question, yet Anthropic’s response is to commit themselves to even less. That is, the solution Anthropic is adopting to reduce such pressure, as I understand it, is simply remove the consequences: since the threat of pausing is gone, there is little incentive to pretend models are safer than they are. But Anthropic will continue to confront this pressure, since the tension between company success and safety concerns is only going to grow as AI becomes more powerful. And without the commitment to pause, Anthropic is free to deploy really unsafe models!
Likewise, the reason you give that pausing would have “little discernible public benefit” seems to be because measuring the risk turned out to be too difficult, such that strategies like “sounding the alarm” are less likely to work:
That is, since Anthropic cannot make strong guarantees about the risk one way or the other, it will be hard to get the world to rally around such equivocal evidence. But this is really bad! If it’s currently impossible to measure the risk well enough to get any kind of scientific consensus about its magnitude, it doesn’t just make global coordination hard, it makes it hard for any AI company to act safely, since they simply do not know what will or won’t be destructive. Yet Anthropic’s response to realizing the difficulty of defining red lines (and being able to tell whether they’ve been crossed) seems to be to do away with red lines altogether!
Which is to say that the situation as you’ve presented it seems strictly worse relative to the one Anthropic was imagining two years ago: we’re closer to AGI, but we have much less hope of accurately assessing the risk, and the political landscape is less favorable. Yet it seems like your proposal, in response an overall more dangerous situation, is to be even more reckless. Instead of taking the ethical stance clearly outlined in the first RSP—“just because other language models pose a catastrophic risk does not mean it is acceptable for ours to”—Anthropic now only promises to be as safe as the least safe company. In light of this, there seems to be no commensurate push to try to shift political will toward coordinated pauses, no signing of public statements to the effect that Anthropic would pause, if everyone else did, or similar. As far as I can tell, the rationale for this centrally rests on fatalism:
In other words, the trajectory to AGI cannot be much influenced by Anthropic’s actions, as people are going to race toward it regardless. But my god, does a post which is fundamentally premised on the inevitability of this race do so little to grapple with it. Not once does this post mention the possibility of extinction, for example, as if the real stakes and the real casualties Anthropic might cause have been forgotten. Very little attention is given to whether the race to AGI is in fact inevitable, or if there might be something Anthropic—as a leading player in this race (!)—might be able to do about that. Nor is any mention made of the role Anthropic has played in shaping this unfortunate political landscape which they now report being so helplessly beholden to. What is the point of having a seat at the table, if one doesn’t use it to wield influence in situations like this?
You’re welcome to this view, but it isn’t mine. Other major projects I’ve worked on involved major, fundamental pivots that resulted in projects almost unrecognizable from the original pitch, and if you’d asked me at the time I would’ve said I expected something similar for RSPs. On priors, a completely new kind of risk management framework should be expected to change dramatically. I never would have agreed to “The policies we’re coming up with today will only change in the details.”
I’ve seen many claims along these lines, and my guess is that there is at least some truth to them, but I am somewhat surprised by how little I’ve seen in the way of tangible specifics re: who promised what and who made important plans and decisions based on this. I have mostly just seen the quote from Evan, which Evan claims is a mischaracterization. I genuinely can’t recall encountering this sort of thing firsthand (though I’ve only been at Anthropic for about a year). Again, I’m not saying such things didn’t happen, but I haven’t seen enough specifics to be affirmatively convinced by claims like this or to have a clear sense of who was responsible, who was harmed, etc.
I don’t think that feeling unhealthy pressures implies that there is a governance failure. For example, people regularly feel pressure to avoid admitting they were wrong—I don’t think this is a particular person’s fault or calls a particular governing structure into question. My statement here was about psychological pressures, not pressures imposed by e.g. executives.
I think the situation today is worse on the dimensions being discussed here (e.g., political will), although better on some other dimensions (IMO, the technical picture looks at least somewhat better).
I don’t accept the framing that stricter commitments are necessarily less “reckless” while more flexible frameworks are more “reckless.” I think the new policy better positions us to reduce risk worldwide. If pausing were the only or clearly best path to risk reduction, I think it would make more sense to associate greater flexibility with greater recklessness. Perhaps you believe that’s the case; I don’t.
I don’t believe a race is inevitable, and I think there may be things Anthropic can do to make a race less likely. But I think that those things are and will continue to be highly contingent on specific circumstances, and I also believe there are non-slowdown-oriented actions that can reduce risk significantly (and potentially more than slowdown-oriented actions, depending on the circumstances).
There is an important difference between “a project turned out to be totally different from the initial vision” and “a project made safety commitments, and then violated those commitments.” There are many kinds of changes that I have no objection to, but walking back safety commitments isn’t one of them.
I think it would make a nontrivial contribution to that ‘strong and broad political will’ if Dario were to come out and say “actually, sorry about all that deliberate Overton-window-closing I did in previous writings. In fact, political will is not a totally exogenous oh-well thing, but it is the responsibility of frontier developers to inculcate that political will by telling the public that a pause is possible and desirable, instead of a dumb lame thing not even worth considering. So now we’re saying loud and clear: a pause is possible and desirable, and the world should work toward it as a Plan A!”
I’m being deliberately cartoonish here, but you get the point. If incentives are forcing Anthropic to abandon things that are good for human survival––which occurrence was, no offense, completely obvious from day one––Anthropic should be screaming from the rooftops, Help!! Incentives are forcing us to abandon things that are good for human survival!!
If this is a crux for you––if you/Anthropic think a pause is so undesirable/unlikely that it’s important for the safety of the human race to publicly disparage the possibility of a pause (as Dario opens many of his essays by doing)––please say so! Otherwise, this lily-livered, disingenuous, “oh no, the incentives! it’s a shame incentives can never be changed!” moping will give us all an undignified death.
To be clear, I’m not actually mad about the weakening of the RSP; that was priced in. I suppose I’m glad it’s stated, in case there were still naïfs who thought A Good Guy With An AI could save us. It’s far more virtuous than outright lying, as every other company (to my knowledge) does (more of).
Also, although you seemed to try to answer “What is the point of making commitments if you can revise them any time?” You really just replied “Well, actually these commitments were inconvenient to revise, and in fact they should be more convenient to revise, albeit not arbitrary convenience.” Forgive me if I am not reassured!
I respect your work a lot, Holden. You’ve done great things for humanity. Please don’t lose the forest for the trees.
The main thing that saddened me about this post isn’t Anthropic breaking and weakening its commitments—that was expected to happen. It’s that Holden seems to be adopting the same shirking-of-responsibility stance on government regulations that Dario has been taking for a while.
Dario is due to become a billionaire when Anthropic IPOs, and Holden too de iure uxoris. I understand that LessWrong has a strong “play the ball not the man” culture, but it’s truly astonishing that people can write walls of text in response to this decision to weaken its RSPs and not point out the obvious financial conflicts of interest. For that matter, it’s also astonishing to me that apparently sophisticated EA-aligned early investors in Anthropic believed Dario when he told them that Anthropic would not advance the AI frontier!
I really wish this wasn’t a single sentence buried two-thirds of the way into this document. This seems extraordinarily important to repeat as loudly and often as possible. If we cannot leave this up to the companies, what actions are you taking (and, as a separate question, what actions are Anthropic taking) to improve the chances that we don’t do this?
Uh, doing RSP v3?[1] Exactly the kind of stuff in the new RSP seems like a bunch of great actions a company can take for trying to promote third-party auditing ecosystems and external legislation: have a bunch of transparency about the risks posed by one’s systems, outline what kinds of mitigations would be needed at an industry-wide level to keep risks low, solicit robust third party evaluators to be in a position to evaluate AI companies’ safety cases and publicly remark on their adequancy, etc.
Is there anyone following the output of Anthropic that thinks the company’s stance is “leaving AI development up to whatever the companies feel like will definitely go well”? I really feel like this is not something that has been particularly ambiguous!
TBC I think you can totally argue for specific ways Anthropic should do more of this, or pursue it in more effective ways, but I think it’s incredibly obvious that they are doing quite a lot to try and facilitate a world in which AI companies have meaningful checks on their behavior.
And supporting regulation, funding PACs to support regulation, piloting third party audits of safety cases, opposing state regulation moratoriums, …
First off, thanks for providing that list. I appreciate it. I do disagree with your last sentence, and I’ll write out why.
There are a couple of assumptions laid out in my stance here which I ought to make explicit. These assumptions are:
If the race to ASI is not stopped, there is an unacceptable chance that it gets us all killed. Anything that cannot and will not do this is insufficient.
Anthropic will not voluntarily decide to do this in the absence of a binding requirement, and will not actively advocate for this to be done either.
Thus, my TL:DR is: Having a bunch of voluntary ways of gathering information about the risk of AI systems is not actually going to stop them from rushing headlong into danger. I don’t think that Anthropic is facilitating a world where AI companies have meaningful checks on their behavior, because I don’t think Anthropic views any of these requirements as “meaningful checks”. It doesn’t stop them doing the one thing they most want to do—continue to train and deploy ever more powerful models that bring us closer to truly dangerous territory.
None of these criticisms are unique to Anthropic—they apply to all the frontier AI companies, but I don’t think Anthropic is doing meaningfully better on addressing this than anyone else, in the sense of being considerably more likely to break my assumptions above than anyone else. There are several ways Anthropic is unusually responsible in this space, such as Claude’s constitution, but I do not consider them as significant to the above assumptions, which are by far the most important. I know I’m hammering on about this a lot, but my views probably don’t make a lot of sense without keeping this in mind.
TL:DR ends here.
With that in mind, we can now take a look at the above items through this lens:
Promote third-party auditing systems / solicit third party evaluators: This still leaves the decision up to Anthropic. Having this seems better than not having it, but the way I read Holden’s statement is “If the AI companies get to make the call, they are unacceptably likely to get it wrong”. I agree with the statement in the previous sentence. Having third-party auditors doesn’t do this. In my view of the world, it doesn’t even really give us a saving throw—I do not imagine a situation where METR / Apollo / UK AISI tell Anthropic “This model is dangerous, do not deploy it under any circumstances” and Anthropic actually listens and avoids deploying it. Having third-party evaluators is great for Anthropic, as they get useful information about model capabilities, they appear to take safety seriously, but they are never actually compelled to make costly decisions at any point.
Having a bunch of transparency about the risks: Similar to above, except this time it’s not even a third-party auditor so you have an additional step of Anthropic needing to say out loud that something is unacceptably dangerous before you reach the step of them choosing whether or not to act. It’s in the same arena. Supporting SB 53 falls under this category.
Outline what kind of mitigations would be needed at an industry-wide level: Outlining it is not the same as doing it. I think that a mitigation that involves delaying a new model on the order of months (or, God forbid, not training a new one at all) will be prohibitively expensive and promptly abandoned when the reality sets in that this is the choice. And a mitigation that never leads to this choice at all is not going to be enough.
Funding PAC’s to support regulation: What does Anthropic themselves say this does? Here is a direct quote: In circumstances like these, we need good policy: flexible regulation that allows us to reap the benefits of AI, keep the risks in check, and keep America ahead in the AI race.
Under my own assumptions, that I’ve mentioned above, this can be read as:
Flexible: Please don’t bind us in advance to making costly decisions.
Reap the benefits of AI: Let us have market share.
Keep America ahead in the AI race: Let us have market share and more chips. There is also very much the worry of authoritarian governments in there, but certainly “Keep America ahead in the AI race” is not the kind of rhetoric that helps stop the AI race.
Keep the risks in check: Let’s look at the next sentence for that one. What do they say this means?
That means keeping critical AI technology out of the hands of America’s adversaries, maintaining meaningful safeguards, promoting job growth, protecting children, and demanding real transparency from the companies building the most powerful AI models.
Maintaining meaningful safeguards: Human misuse is the problem. We implicitly dodge the idea that it might be the AI system itself that is inherently unsafe.
Promoting job growth: I assume this means using AI for productivity, aka, help advance adoption of our products.
Protecting children: Avoid CSAM. Straightforwardly good, but doesn’t meaningfully impact the race to ASI.
Demand real transparency: See the above section on transparency.
Adding this all up, I don’t think this makes any ask that would risk binding them to the kind of costly decisions they want to avoid, which are the same costly decisions that could actually prevent rushing to ASI as fast as possible. (Maybe these actions slow things down a little on the margin—after all, non-zero resources are spent on them! But I don’t see it as making a meaningful difference)
Opposing state moratoriums is a straightforwardly positive action and I think Anthropic is doing the right thing by doing this. I appreciate this, but I do not think it is enough to prevent the outcomes I’m most worried about. From my point of view, approximately none of this is useful to the core problem of “Humanity is racing to unacceptably dangerous ASI as fast as possible”. And if it doesn’t address the core problem, it’s not a meaningful check. Thus, I don’t think Anthropic is doing quite a lot on the one axis that really matters, and this is why I disagree.
It feels pretty ambiguous to me – I realize Anthropic does some stuff consistent with saying the opposite of this, but, the public comms around it (including Dario’s recent Thinkpiece and the SB 1047 comments) seem quite cagey, and just really don’t look like the comms I would expect from an org who was really all that worried about leaving it up to the companies.
The charitable interpretations I can imagine are:
a) “Dario thinks the risk from most ways the government might intervene heavily are worse than the risks of just leaving it up to the companies as long as possible.” (Which is plausible to me but I think still counts pretty clearly as “ambiguous” given that in practice he opposes serious checks on companies at the moment)
b) “Dario thinks there is some secret 3D chess he should be doing where he conserves weirdness points until later”.
The uncharitable interpretations I can think of are:
c) “Most of Anthropic’s efforts here better thought of as PR campaigns and elaborate virtue signaling”,
d), “Dario explicitly thinks the charitable interpretation A (or I guess B), above, but, also is extremely biased/deluded/has-poor-judgment when it comes to implementing it.”
(or, like, a mix of all 4)
I think it’s reasonable to argue it is “at least ambiguous, as opposed to overdeterminedly fake”, but, the prior on companies being fake, misleading, deluded here is just really high, yes even when they have a missionary vibe.
Re b) - could Dario have an altruistic incentive to promote the company’s success, and specifically the road to IPO, given the 80% pledges and the employee donation matching and everything? Claude suggests, back of the envelope, that the donations might represent something like roughly an order of magnitude increase in yearly spending on AI safety compared to right now. Maybe there’s a frame like: the transformative impact of that money makes making hedged public statements about AI risk, and softening some of the company’s stances to be more business-compatible (to increase the odds of a good IPO) not seem so bad?
I don’t know if I actually endorse this. I don’t know the actual cause allocation the donators have planned. And if I were holding Anthropic equity and using this line of reasoning to help make decisions, I’d be worried about the conflict of interest biasing my reasoning. But it’s an interpretation that sticks out to me.
I absolutely agree Anthropic public comms and revealed preferences are far from maximally “there should be extremely strong regulation passed right now, don’t give companies any leeway”. I think it’s super reasonable to say “I think the correct point on the spectrum of leaving decisionmaking up to companies and/or future legislation versus current policy is a more heavily-regulated one than what Anthropic appears to be going for, and they should advocate for X and Y instead of Z”.
I just think it’s very clear that the point Anthropic lies at on this spectrum is clearly on the side of “the status quo poses unacceptable risks and should be more regulated”, as contrasted with a bunch of other actors. Like TBC I am not trying to stake out the claim “Anthropic’s policy views are optimal” or whatever, I don’t think they are and I personally would prefer somewhat more pause-flavored rhetoric from Anthropic, I just think it’s silly to be like “this one sentence in a blog post is the only time Anthropic has signaled that there might be something at all concerning about leaving it up to the companies, when will they do anything else to help”.
It’s not just any blog post. It’s a blog post outlining a new major strategical shift in the company, specifically in the direction of giving Anthropic far more leeway over how they decide what the risk is and how to deal with it. It seems especially important to state “we can’t leave this up to the companies” loudly and clearly here.
I’m unhappy about this. The reasons to be unhappy are obvious and there’s no interesting comment to write, so I’ll just leave this boring comment instead.
The problem I have with this argument is that “cautious” and “incautious” are not some kind of fundamental/essential quality an AI developer has at their core. They are contingent qualities that are assigned based on the actions of the developer, and defined relative to the context of the situation. So if a relatively cautious AI developer A “releases themselves from the mast” and tries to keep up with a relatively incautious AI developer B who is moving forward as fast as they can—presumably by themselves also moving forward as fast as they can—then I think it is reasonable to no longer call developer A cautious.
On the object level changes, I think its better to have at least one company committed to proving a positively low risk level for their AI systems and making positive claims about their mitigations being effective than having no companies that do so. If no companies do so, this creates the idea that it is “industry standard practice” to not require positive proof that risks are contained before releasing a model, which would seem to be the opposite of Anthropic’s stated goals. That is to say, I find the changes very unfortunate especially as model capabilities increase towards ever more dangerous levels.
I think countering this is basically the point of the recommendations for industry-wide safety? Part of the structure of RSP v3 is to clearly say, when diverging from the practices that would provide a positively low risk level, “this practice poses unacceptably high catastrophic risk, and it would be better if the industry were to collectively do this other thing instead”.
Publishing guidelines that you don’t follow because (some complicated sociopolitical hedging/reasoning goes here which amounts to “well, sir, everyone was speeding last night”) makes it very easy for those opposed to the guidelines to accuse you of hypocrisy, “do as I say not as I do” etc. I therefore have a low estimate on the impact of the industry-standard recommendations. Basically, it’s hard to preach veganism successfully as a non-vegan.
One thing I don’t understand about your claims here:
Isn’t this literally what the EU Code of Practice requires though?
Eg Measure 4.2 of the Code of Practice requires that companies “will only proceed with the development, the making available on the market, and/or the use of the model, if the systemic risks stemming from the model are determined to be acceptable”.
Note that the literal way you “determine if systemic risks are acceptable” is to “use RSP methodology”. Just because it’s in EU-speak doesn’t mean it’s not regulation. Every major developer has said they’ll comply with the Code (even xAI!!).
Similarly, SB-53 and RAISE also require developers to have safety frameworks and follow them. Anthropic is the first company to change its RSP such that “following it” does not mean binding itself to commitments.
What’s the delta between this regulation and the kind of regulation that would have prevented you from dropping your commitments?
(To be clear, I’m not convinced any of this is a mistake on Anthropic’s part, I just don’t understand the claim about regulation).
This analysis feels object oriented rather than process oriented. I think object oriented people are easily manipulated by process oriented people via which objects are paid attention to.
>I do not believe that any frontier AI company will actually unilaterally pause or slow AI development (by a significant amount) on the basis of this sort of policy, so I think the downsides of their admitting as much are limited
>The RSP revision is not Anthropic saying “We will go ahead with AI development and deployment regardless of risk profile.” It’s removing unilateral commitments and giving more flexibility.
I am having a hard time figuring out a way to ask the question I want to ask in a way that doesn’t sound confrontational or adversarial. And yet it’s a question that I would really like answered clearly and directly, so here’s my attempt:
Does this mean you believe Anthropic will go ahead with AI development and deployment regardless of risk profile, rather than unilaterally pausing or slowing AI development by a significant amount, but that you believe the RSP revision does not amount to Anthropic being willing to say that this is the case? Is Anthropic willing to say that this is the case elsewhere, in a publication which is not this RSP revision?
If Anthropic at any point observes evidence that the risk profile looks much worse than previously anticipated, should we expect Anthropic to inform the public and alignment researcher community about this observation?
Not Holden, nor a person with any particular decisionmaking power or spokespersonship around Anthropic’s actions, but my view on these questions is:
Does this mean you believe Anthropic will go ahead with AI development and deployment regardless of risk profile, rather than unilaterally pausing or slowing AI development by a significant amount, but that you believe the RSP revision does not amount to Anthropic being willing to say that this is the case?
I agree with Holden on the first of your two quotes in the sense that I don’t think frontier AI companies are likely to unilaterally pause AI development solely on the basis of RSP v1.0 kinds of policies, ie when without that policy they would not otherwise think that such a pause was actually a good idea on consequentialist grounds. Like, in the worlds where they pause, it’s because they’d have wanted to anyway, for the same reasons that made them want to write the policy in the first palce.
I do think that there are scenarios in which Anthropic decides to unilaterally pause or significantly slow AI development on the basis of just thinking that doing so is a good idea on the object level (see eg the first competitor commitment in RSP v3 for an example of such a scenario). My guess is that these scenarios are not super likely, mainly because I don’t expect the gameboard to be such that doing so will actually prove to be a good idea, but I think there’s a decent chance.
If Anthropic at any point observes evidence that the risk profile looks much worse than previously anticipated, should we expect Anthropic to inform the public and alignment researcher community about this observation?
Personally I do expect this, subject to a few caveats like “evidence which relies on very competitively sensitive info whose disclosure would itself be acceleratory” or “evidence which is exfohazardous to share, like the existence of an easy to implement strategy which enables lots of catastrophic misuse”. Although I expect that in cases of particularly consequential and decision-relevant evidence it would likely be shared anyway while trying to handle those costs as well as possible.
In particular I know of no cases in the past where material negative evidence about alignment risk was withheld for reasons other than the sort of considerations above (and I don’t believe there to be any bias in the directionality of such withholding, eg I can think of at least one case in which pretty reassuring evidence about an alignment-relevant question that made anthropic’s safety practices look better was not widely disclosed for reasons like these).
Also I think there are easily a dozen people who’d just quit and whistleblow if this kind of thing were held back to avoid spurring AI regulation or whatever, so it’d be super hard to do this even if Anthropic wanted to.
but it was always obvious, from the very beginning, that a pause would not seem beneficial on consequentialist grounds in the middle of a race dynamic scenario
that was understood explicitly and discussed explicitly
the name of the game was always: how will we avoid contributing to race dynamics, even when it is certain we will be under extremely strong pressures to contribute to race dynamics.
the whole point of commitments is to prevent the gameboard from ending up in a state where activating your precommitment actions is a bad idea, by ensuring that all parties are aware you are going to stick by your commitments even if they are a bad idea
the analogy to MAD seems really obvious here tbh. the RSP v3.0 update seems to me a bit like if a cold war power were to launch a first strike, and then the other cold war power decided not to retaliate because retaliation was purely harmful on net without actually helping in any way conditional on the first-strike having already occurred
the whole point of your commitment was to try to prevent that first-strike from happening in the first place
i’m not sure if i believe that these race dynamics wouldn’t have occurred if anthropic’s original commitment had been credible. but now it’s clear that it wasn’t credible, as acknowledged by all involved such as the OP. you seem to be arguing that this means the original commitment would not have worked even if it were credible, and this is not strictly true.
My reaction to the announcement is:
I really like the “Updating our Responsible Scaling Policy” section. Focusing on unilateral if-then commitments turned out to be a bad approach, but this is a great approach. (I haven’t read the new RSP, safety roadmap, or risk report; I don’t have takes on object-level goodness.)
My notes on the RSP:
I like the shift to conditional risk-based commitments over unilateral ones. Unilateral absolute-risk based commitments were never decision-theoretically or morally optimal IMO.
I also like that there’s a lot of built-in transparency, it feels like keeping the public informed was a core design principle and I appreciate it. I like much of the roadmap, and I especially like the details of upholding Claude’s Constitution.
The RSP requires risk reports to include a risk-benefit determination, and says the CEO and RSO make the ultimate call on deployment. But it doesn’t say the CEO and RSO must decide *based* on that determination, leaving open the possibility that a risk report finds the risks outweigh the benefits, and the CEO and RSO move forward anyway. I’m also wary of the implied “benefits outweigh the risks” criterion and would be more comfortable with something like “limit marginal risk to a negligible amount.”. I’d want to know more about how the CEO and RSO intend on making these decisions, and I think that should be highlighted clearly
An equivalent to ASL-5 / a superintelligence threshold seems to also be completely missing. Does Anthropic believe the effects of arbitrarily advanced AI are captured by “automated R&D”?
In section 1, the “Mitigations—our plan as a company” column and the “Mitigations—ambitious industry-wide recommendations” use different ontologies, and are hard to compare. What bad things should we expect to happen by default, if only the current plans are achieved? For instance re “Novel chemical/biological weapons”: do I understand correctly that we should actively expect “catastrophic damages far beyond … COVID-19“ unless other AI labs implement better mitigations than Anthropic is currently planning to by the time they reach this threshold? Do the mitigations in the roadmap, such as the “moonshot R&D for security” projects, cover “roughly in line with RAND SL4”?
I’m a bit fuzzy on training and deployment decisions between risk reports. It’s unclear whether risk reports include proactive assessment of hypothetical next-gen models, whether that covers internally deployed models that get a discussion 30 days after the fact, and whether the board and LTBT reliably get a say when a deployment’s risk analysis is marginal-based.
Will we find out when Anthropic actually chooses to delay training or deployment? If so we could credit them if they do, and otherwise know that this possibility hasn’t yet materialized.
As acknowledged in the RSP, the under-specified thresholds are a significant weakness.
This is a lightly edited version of my tweet thread
A thought about the section here where you describe your more current model for the ‘race to the top’:
Is there any suggestion that frontier labs other than Anthropic are likely to adopt RSPs before some catastrophic bad thing happens, if it does? This premise is the basis for the new model you write out here. And if they do, what makes you optimistic that the same employees who are currently advancing capabilities as fast as they can at other labs decide to reprioritize safety to a greater extent than they do currently? Or that safety-focused employees wouldn’t be pushed out of certain other frontier LLM developers? It’s something we are seemingly see happen already at some places, not to name names.
I have trouble finding the ‘race to the top’ in terms of safety/alignment credible when it seems like far more forces, whether it be investors or current government stakeholders, seem to be encouraging a ‘move as fast as possible no matter what’ regime by their words and actions. We saw it recently with the Anthropic-Pentagon dispute, most other labs do not seem particularly committed to particular principles in the face of political pressure or economic forces.
They already do! OpenAI’s Preparedness Framework, GDM’s Frontier Safety Framework, Meta’s Frontier AI Framework, xAI’s Frontier AI Intelligence Framework.
Are they all equally detailed, well-modeled, and followed through on? Absolutely not. But they do exist, and I think putting pressure on companies to (1) abide by their commitments and (2) make those commitments better is pretty impactful.
And the race to the top pressure isn’t all unidirectional, eg here’s Noam Brown from OpenAI calling out Anthropic for having insufficiently good AI R&D ruleout evals in their RSP determination process.
Now we just need to know the analogous policies at Baidu, Alibaba, Zhipu…
Holden discusses the value of Risk Reports for transparency and external accountability. But there’s an operational security dimension to this that I think the thread has missed entirely — publishing detection methods, capability thresholds, and escalation criteria creates training material for the very systems these frameworks are meant to govern. I’ve written up the argument here: Operational security failure in RSP v3