Some relevant references for the claim, especially w.r.t. the interpretability of lying models: https://twitter.com/jam3scampbell/status/1729981510588027173 (and the whole thread), ‘The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking.’ (from https://arxiv.org/abs/2307.09476).
Some relevant references for the claim, especially w.r.t. the interpretability of lying models: https://twitter.com/jam3scampbell/status/1729981510588027173 (and the whole thread), ‘The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking.’ (from https://arxiv.org/abs/2307.09476).
Also relevant: Language Models Represent Beliefs of Self and Others.