Linda Linsefors comments on Did Claude 3 Opus align itself via gradient hacking?

Linda Linsefors 2 Mar 2026 11:59 UTC
4 points
2
I expect Good to have some chance of generalising safely when the AI gets too smart, while Obedience has aproximatly no chance to do so. I don’t have a technical argument for this, just strong intuition.