Kaj_Sotala comments on Anthropic: “Statement from Dario Amodei on our discussions with the Department of War”

Kaj_Sotala 28 Feb 2026 17:59 UTC
8 points
4
The Anthropic model cards often reference “helpful-only” versions of Claude that sound like those versions exist late into development, e.g. the Opus 4.6 system card:
1.2.2 Iterative model evaluations

We conducted evaluations throughout the training process to better understand how catastrophic risk-related capabilities evolved over time. We tested multiple different model snapshots (that is, models from various points throughout the training process):
● Multiple “helpful, honest, and harmless” snapshots for Claude Opus 4.6 (i.e. models that underwent broad safety training);
● Multiple “helpful-only” snapshots for Claude Opus 4.6 (i.e. models where safeguards and other harmlessness training were removed); and
● The final release candidate for the model.
For agentic evaluations we sampled from each model snapshot multiple times.