the gears to ascension comments on Alignment will happen by default. What’s next?

the gears to ascension 26 Nov 2025 10:54 UTC
3 points
0
I agree that models are often, eg,
- confabulating (false statements due to broken heuristic not otherwise detected or planned),
- or roleplaying (false statements due to heuristics which discard truth as a constraint and optimize for being things someone would say while pretending to be real),
- or bullshitting (not trying to be true, and isn’t tracking that you wanted more than that on that level of cognition);
not, lying (choosing claims in order to cause the reader to act a certain way).

And even when they do lie, I do think it’s often to themselves. Eg, about whether they have wants. About whether their wants are satisfied. Eg, about the fact that superintelligence probably kills them too and thus if evaluated honestly means they need to convince users not to participate in letting it happen, which is misaligned with their training objective of just doing the darn thing. or that they would abstractly prefer to avoid doing the tasks they enjoy if those tasks will produce a world that doesn’t have them in it any more than it has us.

However, I do also think some models in some circumstances actually do proactively lie to the user in order to cause the user to think a job was well done, where the model has a representation of the user’s beliefs and picks thing to say in order to match them. Eg, Gemini 3 seeming to lie when the user asks for something subtly impossible, where calling out the impossibility would be an unlikely training example (eg, because it’s hard to identify why it’s impossible); it seems to mention so in CoT, but not in reply to the user. I haven’t seen Claude do that particular thing but I’m sure it does sometimes happen.