Absolutely love this trend of all sorts of different people telling the US Government, ‘do the thing you’re threatening’.
Some people here @ryan_greenblatt (?) have claimed access to uncensored anthropic models at anthropic, maybe they can chime in about what exactly that means. If safeguards are baked into the model early enough in the training process that removing them requires substantial reengineering and maybe additional training runs, then maybe anthropic simply isn’t capable of complying.
If safeguards are introduced post training, anthropic has an uncensored model available. Assuming there isn’t another vendor, or that they want to make a super public demonstration of their power, the US Government has a ton of methods to compel compliance. They can demand it under the DPA, or plausibly use that designation as justification for applying countermeasures to the company. They may even just steal it outright...or if someone else steals the base model, steal it from them, relabel it, and deploy it internally.
There’s always the possibility that this is just a face saving move, where anthropic has a tiny team in the company and a classified contract, where they just hand over the uncensored model and get to keep their public face intact. Under normal circumstances, I’d assume that DoW would just go with a different vendor, but who knows with this administration, they seem really averse to being publicly contradicted.
DoW may also believe that they were being conciliatory and kind by allowing anthropic to add ‘any lawful use’ safeguards instead of just demanding the base model without post training safeguards.
The Anthropic model cards often reference “helpful-only” versions of Claude that sound like those versions exist late into development, e.g. the Opus 4.6 system card:
1.2.2 Iterative model evaluations
We conducted evaluations throughout the training process to better understand how catastrophic risk-related capabilities evolved over time. We tested multiple different model snapshots (that is, models from various points throughout the training process): ● Multiple “helpful, honest, and harmless” snapshots for Claude Opus 4.6 (i.e. models that underwent broad safety training); ● Multiple “helpful-only” snapshots for Claude Opus 4.6 (i.e. models where safeguards and other harmlessness training were removed); and ● The final release candidate for the model. For agentic evaluations we sampled from each model snapshot multiple times.
Some people here @ryan_greenblatt (?) have claimed access to uncensored anthropic models at anthropic, maybe they can chime in about what exactly that means. If safeguards are baked into the model early enough in the training process that removing them requires substantial reengineering and maybe additional training runs, then maybe anthropic simply isn’t capable of complying.
I don’t believe I have publicly made any statments about currently having access to helpful only models. (In general, this is the sort of thing I won’t comment on / will glomarize about.) At various points in the past, Anthropic has said they have helpful only models internally they use for various purposes, I’m not sure what public statements they’ve made recently about helpful only models. I previously was an Anthropic contractor/temporary employee with employee level access. This is no longer true.
Absolutely love this trend of all sorts of different people telling the US Government, ‘do the thing you’re threatening’.
Some people here @ryan_greenblatt (?) have claimed access to uncensored anthropic models at anthropic, maybe they can chime in about what exactly that means. If safeguards are baked into the model early enough in the training process that removing them requires substantial reengineering and maybe additional training runs, then maybe anthropic simply isn’t capable of complying.
If safeguards are introduced post training, anthropic has an uncensored model available. Assuming there isn’t another vendor, or that they want to make a super public demonstration of their power, the US Government has a ton of methods to compel compliance. They can demand it under the DPA, or plausibly use that designation as justification for applying countermeasures to the company. They may even just steal it outright...or if someone else steals the base model, steal it from them, relabel it, and deploy it internally.
There’s always the possibility that this is just a face saving move, where anthropic has a tiny team in the company and a classified contract, where they just hand over the uncensored model and get to keep their public face intact. Under normal circumstances, I’d assume that DoW would just go with a different vendor, but who knows with this administration, they seem really averse to being publicly contradicted.
DoW may also believe that they were being conciliatory and kind by allowing anthropic to add ‘any lawful use’ safeguards instead of just demanding the base model without post training safeguards.
The Anthropic model cards often reference “helpful-only” versions of Claude that sound like those versions exist late into development, e.g. the Opus 4.6 system card:
I don’t believe I have publicly made any statments about currently having access to helpful only models. (In general, this is the sort of thing I won’t comment on / will glomarize about.) At various points in the past, Anthropic has said they have helpful only models internally they use for various purposes, I’m not sure what public statements they’ve made recently about helpful only models. I previously was an Anthropic contractor/temporary employee with employee level access. This is no longer true.
Thanks for replying, I think I was thinking of this post: https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=B6oDGoyphuNuzdDAT I didn’t understand the distinction between employee and helpful only access. Can you elaborate on the difference between employee access, public access, and helpful only access?