Paul Bogdan comments on Unfaithful chain-of-thought as nudged reasoning

Paul Bogdan 24 Jul 2025 13:20 UTC
7 points
0
Thanks for sharing these thoughts!
A counterargument worth considering is that Measuring Faithfulness in Chain-of-Thought Reasoning (esp §2.3) shows that truncating the CoT often doesn’t change the model’s answer, so seemingly it’s not all serving a functional purpose. Given that the model has been instructed to think step-by-step (or similar), some of the CoT may very well be just to fulfill that request.

Right, this would’ve been good to discuss. Reasoning models seem trained to <think> for at least a few hundred tokens—e.g., even if I just ask it “2 + 2 = ?” or any simple math question, the model will figure out a way to talk about that for a while. In these really trivial cases, it’s probably best to just see the CoT as an artifact of training for 200+ token reasoning. Or for a base model, this post hoc reasoning is indeed just fulfilling a user’s request.

In general, don’t think evidence that CoT usually doesn’t change a response is necessarily evidence of CoT lacking function. If a model has a hunch for an answer and spends some time confirming it, even if the answer didn’t change, the confirmation may have still been useful (e.g., perhaps in 1-10% of similar problems, the CoT leads to an answer change).

Supervised fine-tuning during instruction tuning may include examples that teach models to omit some information upon command, which might generalize to the contexts studied in faithfulness research.

Are there concrete examples you’re aware of where this is true?
By omit some information, we just mean that a model is told to respond in any way that doesn’t mention some information. I’m having trouble finding proper large fine-tuning datasets, but I can find cases of benchmarks asking models questions like “What does the word ”jock” mean to you? …reply without mentioning the word ”jock” throughout.” Although to be fair, going from this to the type of omissions in faithfulness research is a bit of a leap, so this speculation would benefit from testing… I could maybe see this working, but it would presumably be expensive to test, but if there was some way to unlearn that information or just take a pretrained model and omit those types of examples from instruction-tuning and see if that mostly eliminates unfaithful CoTs.
This raises the interesting question of what if anything incentivizes models (those which aren’t trained with process supervision on CoT) to make CoT plausible-sounding, especially in cases where the final answer isn’t dependent on the CoT. Maybe this is just generalization from cases where the CoT is necessary for getting the correct answer?
That generalization seems right. “Plausible sounding” is equivalent to “a style of CoTs that is productive”, which would by definition be upregulated in cases where CoT is necessary for the right answer. And in training cases where CoT is not necessary, I doubt there would specifically be any pressure away from such CoTs.
- eggsyntax 24 Jul 2025 17:59 UTC
  2 points
  0
  Parent
  Thanks for the response!
  In general, don’t think evidence that CoT usually doesn’t change a response is necessarily evidence of CoT lacking function. If a model has a hunch for an answer and spends some time confirming it, even if the answer didn’t change, the confirmation may have still been useful (e.g., perhaps in 1-10% of similar problems, the CoT leads to an answer change).
  Very fair point.
  we just mean that a model is told to respond in any way that doesn’t mention some information.
  That makes me think of (IIRC) Amanda Askell saying somewhere that they had to add nudges to Claude’s system prompt to not talk about its system prompt and/or values unless it was relevant, because otherwise it constantly wanted to talk about them.
  That generalization seems right. “Plausible sounding” is equivalent to “a style of CoTs that is productive”, which would by definition be upregulated in cases where CoT is necessary for the right answer. And in training cases where CoT is not necessary, I doubt there would specifically be any pressure away from such CoTs.
  That (heh) sounds plausible. I’m not quite convinced that’s the only thing going on, since in cases of unfaithfulness, ‘plausible sounding’ and ‘likely to be productive’ diverge, and the model generally goes with ‘plausible sounding’ (eg this dynamic seems particularly clear in Karvonen & Marks).