come work on dangerous capability mitigations at Anthropic

Dave Orr20 Aug 2025 15:11 UTC

33 points

Hi everyone,

TL;DR the Safeguards team at Anthropic is hiring ML experts to focus on dangerous capability mitigations—this is a very high impact ML Engineer / Research Scientist opportunity related to serious harm from advanced AI capabilities. We’re hiring multiple roles at varying levels of seniority including both IC and management.

The Safeguards team focuses on present day and near future harms. Since we activated our ASL 3 safety standard for Claude 4 Opus, in accordance with our Responsible Scaling Policy, this means that we need to do a great job of protecting the world in case Claude can provide uplift for dangerous capabilities. Safeguards builds constitutional classifiers and probes that run over every Opus conversational turn to look for and suppress dangerous content and defend against jailbreaks.

This is a hard modeling task! The boundary between beneficial and risky content is thin and complicated. In some domains, like cyberattacks, it’s extremely hard to distinguish between legitimate security work and harmful offense. Biology has educational use cases as well as nefarious bioweapon applications, as does chemistry and nuclear content. We need to combine understanding of the prompt, the response, account signals, and anything else that we can think of to stop disastrous outcomes while preserving all the immense value that Claude can bring.

As AI becomes more powerful, the benefits grow, but so do the risks. We’re entering into a new, riskier regime where bad actors could potentially cause very significant harm, due to more advanced capabilities and the rise of more agentic systems. This is why we triggered ASL 3 protections, and also why we need awesome ML folks to help solve these safety problems.

We’re hiring for multiple roles at varying levels of seniority. If you have a strong ML background, either applied or research, and an interest in working on difficult and impactful safety problems, we would love to have you! No specific safety experience needed. We’re hiring in San Francisco and New York, though for exceptional candidates would consider London or remote positions as well.

Apply here!

We also have a number of other roles in Safeguards (scroll down to the Safeguards section) and elsewhere, including nontechnical roles in Safety.

Come work with us!

Dave Orr20 Aug 2025 15:11 UTC

33 points

9 comments1 min readLW link

Zach Stein-Perlman 20 Aug 2025 22:26 UTC
26 points
14
I believe that improving Anthropic’s ability to prevent misuse via API decreases P(doom) much less than e.g. working on risks from misalignment. There are lots of factors pointing in this direction: misuse poses less risk than misalignment, misuse via API is much more on track to be solved (or the solution is much more on track to be cheap) than misalignment, and solving misuse-via-API only prevents a small fraction of misuse risk since there’s also misuse-via-weights (from open-weights models, which are some 6-12 months behind the frontier, or stolen models, which is important since by default powerful models will be stolen). Edit: also note that most misuse-via-API risk comes from non-Anthropic models, and preventing misuse at Anthropic likely won’t transfer to other companies, whereas work on misalignment at Anthropic—not to mention work outside of a frontier AI company—might well improve safety at several companies.
I think most people who could do this could do something even better, such as averting risks from misalignment.
- Dave Orr 20 Aug 2025 23:34 UTC
  8 points
  −2
  Parent
  I 100% endorse working on alignment and agree that it’s super important.
  We do think that misuse mitigations at Anthropic can help improve things generally though race-to-the-top dynamics, and I can attest that while at GDM I was meaningfully influenced by things that Anthropic did.
  - Neel Nanda 21 Aug 2025 1:34 UTC
    3 points
    0
    Parent
    Will Anthropic continue to publicly share research behind its safeguards, and say what kinds of safeguards they’re using? If so, I think there’s clear ways to spread this work to other labs
    - Dave Orr 21 Aug 2025 2:09 UTC
      5 points
      0
      Parent
      We certainly plan to!
- ryan_greenblatt 21 Aug 2025 4:48 UTC
  4 points
  0
  Parent
  From my perspective, the main case for working on safeguards is transfer to misalignment risk. (Both in terms of transferring of the methods and the policy/politics that this work maybe helps with transferring.)
  
  I don’t have a strong view about how much marginal work on the safeguards team transfers to misalignment risk, but I think the case for substantial transfer (e.g. comparable to that of working on the alignment science team doing somewhat thoughtful work) is reasonable, especially if the person is explicitly focused on transferring to misalignment or is working on developing directly misalignment relevant safeguards.
  
  By default, I’d guess that people interested in mitigating misalignment risk should by default (as in, in the absence of a strong comparative adavantage or particularly exciting relevant project) prefer working on the alignment science team at Anthropic (just comparing options at Anthropic) or something else that is more directly targeted at misalignment risk.
Neel Nanda 21 Aug 2025 1:36 UTC
17 points
0
I worked somewhat with Dave while he was at GDM, and can happily endorse him as good competent, sincerely caring about x-risk and a great person to work with! (Very sad to see you go 😢 )
Dave Orr 20 Aug 2025 15:25 UTC
7 points
4
I’m new to Anthropic myself, leading the Safeguards team. I joined a few weeks ago, inspired by the mission and the opportunity. I’m really worried about the world as AI continues to get more powerful, and I couldn’t pass up the chance to help if I could.
I was previously at GDM working on similar problems (miss you all!), but the chance to help drive the safety agenda at Anthropic as we transition to a new scarier world felt too important to miss.
So far everything at Anthropic except my new commute is amazing, but most of all the feeling of the mission is intense and awesome. Also the level of transparency inside the company is astounding for a company this size (not that it’s big compared to many others).
Obviously there could be some honeymoon effect here but honestly I’m having a lot of fun, and I honestly think Safeguards (along with Alignment Science) makes a real different in safety for the world.
Christina Stejskalova 16 Oct 2025 19:23 UTC
1 point
0
Thank you for sharing the resource and the usage policies. It’s a really fascinating and difficult problem, especially in those grey areas where intent is ambiguous to solve. I wonder if there’s a path to bring human experts (like psychologists or ethicists) directly into the loop when there’s risk involved? Aka, rather than train the model with their expert opinions, have a direct line to them whenever potentially dangerous content is flagged.
At the same time, maybe the harder question is whether these conversations should happen on the platform at all, not because models can’t handle them, but because the potential harm might outweigh the benefit.
- Dave Orr 23 Oct 2025 4:28 UTC
  2 points
  0
  Parent
  I think this is a good idea in certain context. Right now we have a weak version of this where we give people a number to call if they seem like they might be showing signs of being suicidal. Stay tuned for more ideas in this direction later this year.