come work on dangerous capability mitigations at Anthropic
Hi everyone,
TL;DR the Safeguards team at Anthropic is hiring ML experts to focus on dangerous capability mitigations—this is a very high impact ML Engineer / Research Scientist opportunity related to serious harm from advanced AI capabilities. We’re hiring multiple roles at varying levels of seniority including both IC and management.
The Safeguards team focuses on present day and near future harms. Since we activated our ASL 3 safety standard for Claude 4 Opus, in accordance with our Responsible Scaling Policy, this means that we need to do a great job of protecting the world in case Claude can provide uplift for dangerous capabilities. Safeguards builds constitutional classifiers and probes that run over every Opus conversational turn to look for and suppress dangerous content and defend against jailbreaks.
This is a hard modeling task! The boundary between beneficial and risky content is thin and complicated. In some domains, like cyberattacks, it’s extremely hard to distinguish between legitimate security work and harmful offense. Biology has educational use cases as well as nefarious bioweapon applications, as does chemistry and nuclear content. We need to combine understanding of the prompt, the response, account signals, and anything else that we can think of to stop disastrous outcomes while preserving all the immense value that Claude can bring.
As AI becomes more powerful, the benefits grow, but so do the risks. We’re entering into a new, riskier regime where bad actors could potentially cause very significant harm, due to more advanced capabilities and the rise of more agentic systems. This is why we triggered ASL 3 protections, and also why we need awesome ML folks to help solve these safety problems.
We’re hiring for multiple roles at varying levels of seniority. If you have a strong ML background, either applied or research, and an interest in working on difficult and impactful safety problems, we would love to have you! No specific safety experience needed. We’re hiring in San Francisco and New York, though for exceptional candidates would consider London or remote positions as well.
We also have a number of other roles in Safeguards (scroll down to the Safeguards section) and elsewhere, including nontechnical roles in Safety.
Come work with us!
I believe that improving Anthropic’s ability to prevent misuse via API decreases P(doom) much less than e.g. working on risks from misalignment. There are lots of factors pointing in this direction: misuse poses less risk than misalignment, misuse via API is much more on track to be solved (or the solution is much more on track to be cheap) than misalignment, and solving misuse-via-API only prevents a small fraction of misuse risk since there’s also misuse-via-weights (from open-weights models, which are some 6-12 months behind the frontier, or stolen models, which is important since by default powerful models will be stolen). Edit: also note that most misuse-via-API risk comes from non-Anthropic models, and preventing misuse at Anthropic likely won’t transfer to other companies, whereas work on misalignment at Anthropic—not to mention work outside of a frontier AI company—might well improve safety at several companies.
I think most people who could do this could do something even better, such as averting risks from misalignment.
I 100% endorse working on alignment and agree that it’s super important.
We do think that misuse mitigations at Anthropic can help improve things generally though race-to-the-top dynamics, and I can attest that while at GDM I was meaningfully influenced by things that Anthropic did.
Will Anthropic continue to publicly share research behind its safeguards, and say what kinds of safeguards they’re using? If so, I think there’s clear ways to spread this work to other labs
We certainly plan to!
From my perspective, the main case for working on safeguards is transfer to misalignment risk. (Both in terms of transferring of the methods and the policy/politics that this work maybe helps with transferring.)
I don’t have a strong view about how much marginal work on the safeguards team transfers to misalignment risk, but I think the case for substantial transfer (e.g. comparable to that of working on the alignment science team doing somewhat thoughtful work) is reasonable, especially if the person is explicitly focused on transferring to misalignment or is working on developing directly misalignment relevant safeguards.
By default, I’d guess that people interested in mitigating misalignment risk should by default (as in, in the absence of a strong comparative adavantage or particularly exciting relevant project) prefer working on the alignment science team at Anthropic (just comparing options at Anthropic) or something else that is more directly targeted at misalignment risk.
I worked somewhat with Dave while he was at GDM, and can happily endorse him as good competent, sincerely caring about x-risk and a great person to work with! (Very sad to see you go 😢 )
I’m new to Anthropic myself, leading the Safeguards team. I joined a few weeks ago, inspired by the mission and the opportunity. I’m really worried about the world as AI continues to get more powerful, and I couldn’t pass up the chance to help if I could.
I was previously at GDM working on similar problems (miss you all!), but the chance to help drive the safety agenda at Anthropic as we transition to a new scarier world felt too important to miss.
So far everything at Anthropic except my new commute is amazing, but most of all the feeling of the mission is intense and awesome. Also the level of transparency inside the company is astounding for a company this size (not that it’s big compared to many others).
Obviously there could be some honeymoon effect here but honestly I’m having a lot of fun, and I honestly think Safeguards (along with Alignment Science) makes a real different in safety for the world.