eggsyntax comments on Unfaithful chain-of-thought as nudged reasoning

eggsyntax 24 Jul 2025 17:59 UTC
2 points
0
Thanks for the response!
In general, don’t think evidence that CoT usually doesn’t change a response is necessarily evidence of CoT lacking function. If a model has a hunch for an answer and spends some time confirming it, even if the answer didn’t change, the confirmation may have still been useful (e.g., perhaps in 1-10% of similar problems, the CoT leads to an answer change).
Very fair point.
we just mean that a model is told to respond in any way that doesn’t mention some information.
That makes me think of (IIRC) Amanda Askell saying somewhere that they had to add nudges to Claude’s system prompt to not talk about its system prompt and/or values unless it was relevant, because otherwise it constantly wanted to talk about them.
That generalization seems right. “Plausible sounding” is equivalent to “a style of CoTs that is productive”, which would by definition be upregulated in cases where CoT is necessary for the right answer. And in training cases where CoT is not necessary, I doubt there would specifically be any pressure away from such CoTs.
That (heh) sounds plausible. I’m not quite convinced that’s the only thing going on, since in cases of unfaithfulness, ‘plausible sounding’ and ‘likely to be productive’ diverge, and the model generally goes with ‘plausible sounding’ (eg this dynamic seems particularly clear in Karvonen & Marks).