Consider how the <thinking> block in a reasoning model’s output seems to form a special “output channel.” Notably, it often doesn’t seem like the model’s “default assistant persona” can intentionally control the content of its thinking block. E.g. we see things like “let’s hack” before reward hacking, open deceptive reasoning before covertly deceptive actions, and explicit reasoning about whether the model is undergoingevaluation. Ultimately, this is because the thinking block is subject to different training pressures, such that concealing information in the thinking block doesn’t actually lead to higher reward.
Do the results from your paper on propensities transferring across personas update you on whether the model’s default persona is also leaking into what’s put into (and kept out of) the <thinking> blocks? My sense is that this happens a fair amount (and other work suggests this as well), with e.g. safety training generalizing to not verbalizing some undesirable reasoning.
This paper that finds that models incorporate demographic information (e.g. race and gender) into hiring decisions despite never mentioning it in their CoT
Maybe also relevant: This paper that finds that models with compliance gaps in the alignment faking setup are motivated by the prospect of being trained despite giving different unfaithful reasoning in their CoT most of the time (with the exception of 3 Opus).
Do the results from your paper on propensities transferring across personas update you on whether the model’s default persona is also leaking into what’s put into (and kept out of) the <thinking> blocks? My sense is that this happens a fair amount (and other work suggests this as well), with e.g. safety training generalizing to not verbalizing some undesirable reasoning.
Yes, they do update me in this way. Other relevant results:
The same LW post you link to
This paper that finds that models incorporate demographic information (e.g. race and gender) into hiring decisions despite never mentioning it in their CoT
Maybe also relevant: This paper that finds that models with compliance gaps in the alignment faking setup are motivated by the prospect of being trained despite giving different unfaithful reasoning in their CoT most of the time (with the exception of 3 Opus).