This is false. Our ASL-4 thresholds are clearly specified in the current RSP—see “CBRN-4” and “AI R&D-4″. We evaluated Claude Opus 4 for both of these thresholds prior to release and found that the model was not ASL-4. All of these evaluations are detailed in the Claude 4 system card.
I wrote the article Mikhail referenced and wanted to clarify some things.
The thresholds are specified, but the original commitment says, “We commit to define ASL-4 evaluations before we first train ASL-3 models (i.e. before continuing training beyond when ASL-3 evaluations are triggered). Similarly, we commit to define ASL-5 evaluations before training ASL-4 models, and so forth,” and, regarding ASL-4, “Capabilities and warning sign evaluations defined before training ASL-3 models.”
The latest RSP says this of CBRN-4 Required Safeguards, “We expect this threshold will require the ASL-4 Deployment and Security Standards. We plan to add more information about what those entail in a future update.”
Additionally, AI R&D 4 (confusingly) corresponds to ASL-3 and AI R&D 5 corresponds to ASL-4. This is what the latest RSP says about AI R&D 5 Required Safeguards, “At minimum, the ASL-4 Security Standard (which would protect against model-weight theft by state-level adversaries) is required, although we expect a higher security standard may be required. As with AI R&D-4, we also expect an affirmative case will be required.”
I agree that the current thresholds and terminology are confusing, but it is definitely not the case that we just dropped ASL-4. Both CBRN-4 and AI R&D-4 are thresholds that we have not yet reached, that would mandate further protections, and that we actively evaluated for and ruled out in Claude Opus 4.
AFAICT, now that ASL-3 has been implemented, the upcoming AI R&D threshold, AI R&D-4, would not mandate any further security or deployment protections. It only requires ASL-3. However, it would require an affirmative safety case concerning misalignment.
I assume this is what you meant by “further protections” but I just wanted to point this fact out for others, because I do think one might read this comment and expect AI R&D 4 to require ASL-4. It doesn’t.
I am quite worried about misuse when we hit AI R&D 4 (perhaps even moreso than I’m worried about misalignment) — and if I understand the policy correctly, there are no further protections against misuse mandated at this point.
Regardless, it seems like Anthropic is walking back its previous promise: “We have decided not to maintain a commitment to define ASL-N+1 evaluations by the time we develop ASL-N models.” The stance that Anthropic takes to its commitments—things which can be changed later if they see fit—seems to cheapen the term, and makes me skeptical that the policy, as a whole, will be upheld. If people want to orient to the rsp as a provisional intent to act responsibly, then this seems appropriate. But they should not be mistaken nor conflated with a real promise to do what was said.
FYI, I was (and remain to this day) confused by AI R&D 4 being called an “ASL-4” threshold. AFAICT as an outsider, ASL-4 refers to a set of deployment and security standards that are now triggered by dangerous capability thresholds, and confusingly, AI R&D 4 corresponds to the ASL-3 standard.
AI R&D 5, on the other hand, corresponds to ASL-4, but only on the security side (nothing is said about the deployment side, which matters quite a bit given that Anthropic includes internal deployment here and AI R&D 5 will be very tempting to deploy internally)
I’m also confused because the content of both AI R&D 4 and AI R&D 5 is seemingly identical to the content of the nearest upcoming threshold in the October 2024 policy (which I took to be the ASL-3 threshold). A rough sketch of what I think happened:
A rough sketch of my understanding of the current policy:
When I squint hard enough at this for a while, I think I can kind of see the logic: the model likely to trigger the CBRN threshold requiring ASL-3 seems quite close, whereas we might be further from the very-high threshold that was the October AI R&D threshold (now AI R&D 4), so the October AI R&D threshold was just bumped to the next level (and the one after that since causing dramatic scaling of effective compute is even harder than being a entry-level remote worker… maybe) with some confidence that we were still somewhat far away from it and thus it can be treated effectively as today’s upcoming + to-be-defined (what would have been called n+1) threshold.
I just get lost when we call it an ASL-4 threshold (it’s not, it’s an ASL-3 threshold), and also it mostly makes me sad that these thresholds are so high because I want Anthropic to get some practice reps in implementing the RSP before it’s suddenly hit with an endless supply of fully automated remote workers (plausibly the next threshold, AI R&D 4, requiring nothing more than the deployment + security standards Anthropic already put in place as of today).
I wish today’s AI R&D 4 threshold had been set at what, in the October policy, was called a “checkpoint” on the way to ASL-3: completing 2-8 hour SWE tasks. It looks like we’re about there, and it also looks like we’re about at CBRN-4, and ASL-3 seems like a reasonable set of precautions for both milestones. I do not think ASL-3 will be appropriate when we truly get endless parallelized drop-in Anthropic researchers, even if they have not yet been shown to dramatically increase the rate of effective scaling.
This is false. Our ASL-4 thresholds are clearly specified in the current RSP—see “CBRN-4” and “AI R&D-4″. We evaluated Claude Opus 4 for both of these thresholds prior to release and found that the model was not ASL-4. All of these evaluations are detailed in the Claude 4 system card.
I wrote the article Mikhail referenced and wanted to clarify some things.
The thresholds are specified, but the original commitment says, “We commit to define ASL-4 evaluations before we first train ASL-3 models (i.e. before continuing training beyond when ASL-3 evaluations are triggered). Similarly, we commit to define ASL-5 evaluations before training ASL-4 models, and so forth,” and, regarding ASL-4, “Capabilities and warning sign evaluations defined before training ASL-3 models.”
The latest RSP says this of CBRN-4 Required Safeguards, “We expect this threshold will require the ASL-4 Deployment and Security Standards. We plan to add more information about what those entail in a future update.”
Additionally, AI R&D 4 (confusingly) corresponds to ASL-3 and AI R&D 5 corresponds to ASL-4. This is what the latest RSP says about AI R&D 5 Required Safeguards, “At minimum, the ASL-4 Security Standard (which would protect against model-weight theft by state-level adversaries) is required, although we expect a higher security standard may be required. As with AI R&D-4, we also expect an affirmative case will be required.”
I agree that the current thresholds and terminology are confusing, but it is definitely not the case that we just dropped ASL-4. Both CBRN-4 and AI R&D-4 are thresholds that we have not yet reached, that would mandate further protections, and that we actively evaluated for and ruled out in Claude Opus 4.
AFAICT, now that ASL-3 has been implemented, the upcoming AI R&D threshold, AI R&D-4, would not mandate any further security or deployment protections. It only requires ASL-3. However, it would require an affirmative safety case concerning misalignment.
I assume this is what you meant by “further protections” but I just wanted to point this fact out for others, because I do think one might read this comment and expect AI R&D 4 to require ASL-4. It doesn’t.
I am quite worried about misuse when we hit AI R&D 4 (perhaps even moreso than I’m worried about misalignment) — and if I understand the policy correctly, there are no further protections against misuse mandated at this point.
Not meaning to imply that Anthropic has dropped ASL-4! Just wanted to call out that this is does represent a change from the Sept. 2023 RSP.
Regardless, it seems like Anthropic is walking back its previous promise: “We have decided not to maintain a commitment to define ASL-N+1 evaluations by the time we develop ASL-N models.” The stance that Anthropic takes to its commitments—things which can be changed later if they see fit—seems to cheapen the term, and makes me skeptical that the policy, as a whole, will be upheld. If people want to orient to the rsp as a provisional intent to act responsibly, then this seems appropriate. But they should not be mistaken nor conflated with a real promise to do what was said.
Oops. Thank you and apologies.
FYI, I was (and remain to this day) confused by AI R&D 4 being called an “ASL-4” threshold. AFAICT as an outsider, ASL-4 refers to a set of deployment and security standards that are now triggered by dangerous capability thresholds, and confusingly, AI R&D 4 corresponds to the ASL-3 standard.
AI R&D 5, on the other hand, corresponds to ASL-4, but only on the security side (nothing is said about the deployment side, which matters quite a bit given that Anthropic includes internal deployment here and AI R&D 5 will be very tempting to deploy internally)
I’m also confused because the content of both AI R&D 4 and AI R&D 5 is seemingly identical to the content of the nearest upcoming threshold in the October 2024 policy (which I took to be the ASL-3 threshold). A rough sketch of what I think happened:
A rough sketch of my understanding of the current policy:
When I squint hard enough at this for a while, I think I can kind of see the logic: the model likely to trigger the CBRN threshold requiring ASL-3 seems quite close, whereas we might be further from the very-high threshold that was the October AI R&D threshold (now AI R&D 4), so the October AI R&D threshold was just bumped to the next level (and the one after that since causing dramatic scaling of effective compute is even harder than being a entry-level remote worker… maybe) with some confidence that we were still somewhat far away from it and thus it can be treated effectively as today’s upcoming + to-be-defined (what would have been called n+1) threshold.
I just get lost when we call it an ASL-4 threshold (it’s not, it’s an ASL-3 threshold), and also it mostly makes me sad that these thresholds are so high because I want Anthropic to get some practice reps in implementing the RSP before it’s suddenly hit with an endless supply of fully automated remote workers (plausibly the next threshold, AI R&D 4, requiring nothing more than the deployment + security standards Anthropic already put in place as of today).
I wish today’s AI R&D 4 threshold had been set at what, in the October policy, was called a “checkpoint” on the way to ASL-3: completing 2-8 hour SWE tasks. It looks like we’re about there, and it also looks like we’re about at CBRN-4, and ASL-3 seems like a reasonable set of precautions for both milestones. I do not think ASL-3 will be appropriate when we truly get endless parallelized drop-in Anthropic researchers, even if they have not yet been shown to dramatically increase the rate of effective scaling.