Zach Stein-Perlman comments on come work on dangerous capability mitigations at Anthropic

Zach Stein-Perlman 20 Aug 2025 22:26 UTC
26 points
14
I believe that improving Anthropic’s ability to prevent misuse via API decreases P(doom) much less than e.g. working on risks from misalignment. There are lots of factors pointing in this direction: misuse poses less risk than misalignment, misuse via API is much more on track to be solved (or the solution is much more on track to be cheap) than misalignment, and solving misuse-via-API only prevents a small fraction of misuse risk since there’s also misuse-via-weights (from open-weights models, which are some 6-12 months behind the frontier, or stolen models, which is important since by default powerful models will be stolen). Edit: also note that most misuse-via-API risk comes from non-Anthropic models, and preventing misuse at Anthropic likely won’t transfer to other companies, whereas work on misalignment at Anthropic—not to mention work outside of a frontier AI company—might well improve safety at several companies.
I think most people who could do this could do something even better, such as averting risks from misalignment.
- Dave Orr 20 Aug 2025 23:34 UTC
  8 points
  −2
  Parent
  I 100% endorse working on alignment and agree that it’s super important.
  We do think that misuse mitigations at Anthropic can help improve things generally though race-to-the-top dynamics, and I can attest that while at GDM I was meaningfully influenced by things that Anthropic did.
  - Neel Nanda 21 Aug 2025 1:34 UTC
    3 points
    0
    Parent
    Will Anthropic continue to publicly share research behind its safeguards, and say what kinds of safeguards they’re using? If so, I think there’s clear ways to spread this work to other labs
    - Dave Orr 21 Aug 2025 2:09 UTC
      5 points
      0
      Parent
      We certainly plan to!
- ryan_greenblatt 21 Aug 2025 4:48 UTC
  4 points
  0
  Parent
  From my perspective, the main case for working on safeguards is transfer to misalignment risk. (Both in terms of transferring of the methods and the policy/politics that this work maybe helps with transferring.)
  
  I don’t have a strong view about how much marginal work on the safeguards team transfers to misalignment risk, but I think the case for substantial transfer (e.g. comparable to that of working on the alignment science team doing somewhat thoughtful work) is reasonable, especially if the person is explicitly focused on transferring to misalignment or is working on developing directly misalignment relevant safeguards.
  
  By default, I’d guess that people interested in mitigating misalignment risk should by default (as in, in the absence of a strong comparative adavantage or particularly exciting relevant project) prefer working on the alignment science team at Anthropic (just comparing options at Anthropic) or something else that is more directly targeted at misalignment risk.