Charlie Steiner comments on Examples of Subtle Alignment Failures from Claude and Gemini

Charlie Steiner 9 Jan 2026 3:00 UTC
3 points
3
I think the downvotes are because it seems like:
You are operating at the wrong level of abstraction—you phrase this as “getting them to admit” to what to you seems like their true feelings. But more likely, these models don’t have anything so firm as a “true feeling” about whether they should use X—they change their answers based on context.
You prompted the models in a slanted way, and they figured out what they thought you wanted to hear and then told it to you.
You seem to object-level care about whether the models want to post on X or not, and I don’t think it’s a big deal. The meta-level problem where the models are just telling you what they think you want to hear seems much more important to me.
You for some reason give kudos to xAI, I think mixing up training and deployment, in a way that makes me think that a discussion is probably going to get dragged into ideology in a way I find distasteful.