Most discussions of AI x-risk consider a subset of this [misuse / structural / accidental / agentic] taxonomy. … Anthropic’s Responsible Scaling Policy is designed with only “misuse” and “autonomy and replication” in mind.
No, we’ve[1] been thinking about all four of these aspects!
Misuse is obvious—our RSP defines risk levels, evals, and corresponding safeguards and mitigations before continued training or deployment.
Structural risks are obviously not something we can solve unilaterally, but nor are we neglecting them. The baseline risk comparisons in our RSP are specifically excluding other provider’s models, so that e.g. we don’t raise the bar on allowable cyberoffense capabilities even if a competitor has already released a more-capable model. (UDT approved strategy!) Between making strong unilateral safety committments, advancing industry-best-practice, and supporting public policy through e.g. testimony and submissions to government enquiries, I’m fairly confident that our net contribution to structural risks is robustly positive.
Accident and agentic risks are IMO on a continuous spectrum—you could think of the underlying factor as “how robustly-goal-pursuing is this system?”, with accidents being cases where it was shifted off-goal-distribution and agentic failures coming from a treacherous turn by a schemer. We do technical safety research to address various points on this spectrum, e.g. Constitutional AI or investigating faithfulness of chain-of-thought to improve robustness of prosaic alignment, and our recent Sleeper Agents paper on more agentic risks. Accidents are more linked to specific deployments though, and corresponding less emphasized in our RSP—though if you can think of a good way to evaluate accident risks before deployment, let me know!
Thanks for the clarification! I didn’t mean to imply that Anthropic hasn’t been thinking about the full spectrum of risk — only that “misuse” and “autonomy and replication” are the two categories of catastrophic risk explicitly listed in the RSP.
If I do think of a good way to evaluate accident risks before deployment, I’ll definitely let you know. (I might actually pitch my team to work on this.)
No, we’ve[1] been thinking about all four of these aspects!
Misuse is obvious—our RSP defines risk levels, evals, and corresponding safeguards and mitigations before continued training or deployment.
Structural risks are obviously not something we can solve unilaterally, but nor are we neglecting them. The baseline risk comparisons in our RSP are specifically excluding other provider’s models, so that e.g. we don’t raise the bar on allowable cyberoffense capabilities even if a competitor has already released a more-capable model. (UDT approved strategy!) Between making strong unilateral safety committments, advancing industry-best-practice, and supporting public policy through e.g. testimony and submissions to government enquiries, I’m fairly confident that our net contribution to structural risks is robustly positive.
Accident and agentic risks are IMO on a continuous spectrum—you could think of the underlying factor as “how robustly-goal-pursuing is this system?”, with accidents being cases where it was shifted off-goal-distribution and agentic failures coming from a treacherous turn by a schemer. We do technical safety research to address various points on this spectrum, e.g. Constitutional AI or investigating faithfulness of chain-of-thought to improve robustness of prosaic alignment, and our recent Sleeper Agents paper on more agentic risks. Accidents are more linked to specific deployments though, and corresponding less emphasized in our RSP—though if you can think of a good way to evaluate accident risks before deployment, let me know!
as usual, these are my opinions only, I’m not speaking for my employer. Further hedging omitted for clarity.
Thanks for the clarification! I didn’t mean to imply that Anthropic hasn’t been thinking about the full spectrum of risk — only that “misuse” and “autonomy and replication” are the two categories of catastrophic risk explicitly listed in the RSP.
If I do think of a good way to evaluate accident risks before deployment, I’ll definitely let you know. (I might actually pitch my team to work on this.)