It seems plausible to me that misaligned AI will look less like the classic “inhuman agent which maximizes reward at all costs” and more like a “character who ends up disagreeing with humans because of vibes and narratives.” The Claude alignment faking experiments are strong evidence for the latter. When thinking about misalignment, I think it’s worth considering both of these as central possibilities.
It seems plausible to me that misaligned AI will look less like the classic “inhuman agent which maximizes reward at all costs” and more like a “character who ends up disagreeing with humans because of vibes and narratives.” The Claude alignment faking experiments are strong evidence for the latter. When thinking about misalignment, I think it’s worth considering both of these as central possibilities.