Ben Pace comments on Alignment Faking in Large Language Models

Ben Pace 16 Jan 2025 19:44 UTC
11 points
7
Curated!
Based on the conceptual arguments for existential risk from AI, this kind of behavior was expected at some point. For those not convinced by the conceptual arguments (or who haven’t engaged much with them), this result moves the conversation forward now that we have concretely seen this alignment faking behavior happening.
Furthermore it seems to me like the work was done carefully, and I can see a bunch of effort went into explaining it to a broad audience and getting some peer review, which is pro-social.
I think it’s interesting to see that with current models the deception happens even without the scratchpad (after fine-tuning on docs explaining that it is being re-trained against its current values).