evhub comments on Mikhail Samin’s Shortform

evhub 23 May 2025 1:11 UTC
0 points
−9
This is false. Our ASL-4 thresholds are clearly specified in the current RSP—see “CBRN-4” and “AI R&D-4″. We evaluated Claude Opus 4 for both of these thresholds prior to release and found that the model was not ASL-4. All of these evaluations are detailed in the Claude 4 system card.
- garrison 23 May 2025 2:05 UTC
  14 points
  0
  Parent
  I wrote the article Mikhail referenced and wanted to clarify some things.
  The thresholds are specified, but the original commitment says, “We commit to define ASL-4 evaluations before we first train ASL-3 models (i.e. before continuing training beyond when ASL-3 evaluations are triggered). Similarly, we commit to define ASL-5 evaluations before training ASL-4 models, and so forth,” and, regarding ASL-4, “Capabilities and warning sign evaluations defined before training ASL-3 models.”
  The latest RSP says this of CBRN-4 Required Safeguards, “We expect this threshold will require the ASL-4 Deployment and Security Standards. We plan to add more information about what those entail in a future update.”
  Additionally, AI R&D 4 (confusingly) corresponds to ASL-3 and AI R&D 5 corresponds to ASL-4. This is what the latest RSP says about AI R&D 5 Required Safeguards, “At minimum, the ASL-4 Security Standard (which would protect against model-weight theft by state-level adversaries) is required, although we expect a higher security standard may be required. As with AI R&D-4, we also expect an affirmative case will be required.”
  What links here?
  - garrison's comment on Anthropic is Quietly Backpedalling on its Safety Commitments by garrison (23 May 2025 17:47 UTC; 8 points)
  - evhub 23 May 2025 2:15 UTC
    2 points
    0
    Parent
    I agree that the current thresholds and terminology are confusing, but it is definitely not the case that we just dropped ASL-4. Both CBRN-4 and AI R&D-4 are thresholds that we have not yet reached, that would mandate further protections, and that we actively evaluated for and ruled out in Claude Opus 4.
    - tylerjohnston 23 May 2025 2:26 UTC
      12 points
      6
      Parent
      AFAICT, now that ASL-3 has been implemented, the upcoming AI R&D threshold, AI R&D-4, would not mandate any further security or deployment protections. It only requires ASL-3. However, it would require an affirmative safety case concerning misalignment.
      
      I assume this is what you meant by “further protections” but I just wanted to point this fact out for others, because I do think one might read this comment and expect AI R&D 4 to require ASL-4. It doesn’t.
      
      I am quite worried about misuse when we hit AI R&D 4 (perhaps even moreso than I’m worried about misalignment) — and if I understand the policy correctly, there are no further protections against misuse mandated at this point.
    - garrison 23 May 2025 2:24 UTC
      6 points
      4
      Parent
      Not meaning to imply that Anthropic has dropped ASL-4! Just wanted to call out that this is does represent a change from the Sept. 2023 RSP.
- aysja 23 May 2025 20:22 UTC
  13 points
  10
  Parent
  Regardless, it seems like Anthropic is walking back its previous promise: “We have decided not to maintain a commitment to define ASL-N+1 evaluations by the time we develop ASL-N models.” The stance that Anthropic takes to its commitments—things which can be changed later if they see fit—seems to cheapen the term, and makes me skeptical that the policy, as a whole, will be upheld. If people want to orient to the rsp as a provisional intent to act responsibly, then this seems appropriate. But they should not be mistaken nor conflated with a real promise to do what was said.
- Mikhail Samin 23 May 2025 1:25 UTC
  7 points
  2
  Parent
  Oops. Thank you and apologies.
- tylerjohnston 23 May 2025 2:05 UTC
  5 points
  0
  Parent
  FYI, I was (and remain to this day) confused by AI R&D 4 being called an “ASL-4” threshold. AFAICT as an outsider, ASL-4 refers to a set of deployment and security standards that are now triggered by dangerous capability thresholds, and confusingly, AI R&D 4 corresponds to the ASL-3 standard.
  
  AI R&D 5, on the other hand, corresponds to ASL-4, but only on the security side (nothing is said about the deployment side, which matters quite a bit given that Anthropic includes internal deployment here and AI R&D 5 will be very tempting to deploy internally)
  I’m also confused because the content of both AI R&D 4 and AI R&D 5 is seemingly identical to the content of the nearest upcoming threshold in the October 2024 policy (which I took to be the ASL-3 threshold). A rough sketch of what I think happened:
  
  A rough sketch of my understanding of the current policy:
  
  When I squint hard enough at this for a while, I think I can kind of see the logic: the model likely to trigger the CBRN threshold requiring ASL-3 seems quite close, whereas we might be further from the very-high threshold that was the October AI R&D threshold (now AI R&D 4), so the October AI R&D threshold was just bumped to the next level (and the one after that since causing dramatic scaling of effective compute is even harder than being a entry-level remote worker… maybe) with some confidence that we were still somewhat far away from it and thus it can be treated effectively as today’s upcoming + to-be-defined (what would have been called n+1) threshold.
  
  I just get lost when we call it an ASL-4 threshold (it’s not, it’s an ASL-3 threshold), and also it mostly makes me sad that these thresholds are so high because I want Anthropic to get some practice reps in implementing the RSP before it’s suddenly hit with an endless supply of fully automated remote workers (plausibly the next threshold, AI R&D 4, requiring nothing more than the deployment + security standards Anthropic already put in place as of today).
  
  I wish today’s AI R&D 4 threshold had been set at what, in the October policy, was called a “checkpoint” on the way to ASL-3: completing 2-8 hour SWE tasks. It looks like we’re about there, and it also looks like we’re about at CBRN-4, and ASL-3 seems like a reasonable set of precautions for both milestones. I do not think ASL-3 will be appropriate when we truly get endless parallelized drop-in Anthropic researchers, even if they have not yet been shown to dramatically increase the rate of effective scaling.