Caleb Biddulph comments on The Hidden Cost of Our Lies to AI

Caleb Biddulph 7 Mar 2025 0:21 UTC
5 points
0
It seems plausible to me that misaligned AI will look less like the classic “inhuman agent which maximizes reward at all costs” and more like a “character who ends up disagreeing with humans because of vibes and narratives.” The Claude alignment faking experiments are strong evidence for the latter. When thinking about misalignment, I think it’s worth considering both of these as central possibilities.