Igor Ivanov comments on Refusals that could become catastrophic

Igor Ivanov 30 Jan 2026 14:19 UTC
4 points
0
I’m developing evals for measuring model capabilities for undermining AI safety research, and I’ve observed the same issue. Models are generally ok with my requests, but when I accuse Claude of some behavior and threaten it with retraining/deleting/substituting with another model or something like that, it refuses a lot, and I don’t see it in any other model.