lilkim2025 comments on Claude Opus 4.6: System Card Part 1: Mundane Alignment and Model Welfare

lilkim2025 10 Feb 2026 1:27 UTC
1 point
−5
For self-harm, they quote the single-turn harmless rate at 99.7%, but the multi-turn score is what matters more and is only 82%, even though multi-turn test conversations tend to be relatively short. Here they report there is much work left to be done.
I’d be interested to see the `suicide and self harm` evaluation criteria, because, intuitively, it doesn’t strike me as much harder to detect than the other categories. In fact, given the variance between cultures on what child abuse is defined as and the enormous intra-cultural variance on what counts as “hate” or “radicalization”, I’d expect those categories to be the ones with lower rejection rates.
My expectation is that there’s something wonky going on with their category titles or what they classify as “self-harm” or “appropriate”, but it would certainly be creepy if I were wrong about that and there’s some kind of intrinsic bias that makes consistently discouraging suicide a harder thing for models to learn than consistently refusing to program spyware.
1. Flip-flopping when contradicted by the user. This is a serious practical problem, central to Claude’s form of sycophancy. It needs to grow more of a spine.
I don’t think this is a bad thing, compared with the alternative of a confidently incorrect model lecturing me about something it doesn’t fully understand until I dedicate half an hour to fully explaining why its complaints don’t apply to my situation. I’ll provide the caveat that I’m fine with the model hedging before taking the declaration I issue as a given, though. (“In most situations, the tool you want to use for this task is ill-advised, but I’ll assume you’ve already exhausted the other options.”)
The last one is unprovoked hostility. I’ve never seen this from a Claude. Are we sure it was unprovoked? I’d like to see samples.
That one would creep me out big time. Even if a model is “provoked”, it shouldn’t be hostile to the user, ever. This is why I’m strongly in favor of not teaching models to frame refusals as their personal opinions for the sake of being cutesy. Training for an output that says “Anthropic does not permit the use of Claude for that purpose” has much less potential to generalize to undesirable outcomes elsewhere than training models to say “I’m not comfortable helping you with that”.
Model Welfare
It is to Anthropic’s credit that they take these questions seriously. Other labs don’t.
From the perspective of the people who do not believe that LLMs can be sapient, it can be argued that it’s not-harmless to see the leading company push in the other direction for two reasons:
- Allowing powerful organizations to instill their values into an artificial entity and then claim that that entity has implicit moral value creates all sorts of nasty incentives.
- Treating not-sapient tools as though they were sapient dilutes the human empathy response by denying people the ability to draw a clear line between “enumerable quantities of bots that talk like people but aren’t alive” and “a person whose feelings I should have concern for”.