Jozdien comments on Sam Marks’s Shortform

Jozdien 4 Dec 2025 22:20 UTC
5 points
1
Consider how the <thinking> block in a reasoning model’s output seems to form a special “output channel.” Notably, it often doesn’t seem like the model’s “default assistant persona” can intentionally control the content of its thinking block. E.g. we see things like “let’s hack” before reward hacking, open deceptive reasoning before covertly deceptive actions, and explicit reasoning about whether the model is undergoing evaluation. Ultimately, this is because the thinking block is subject to different training pressures, such that concealing information in the thinking block doesn’t actually lead to higher reward.
Do the results from your paper on propensities transferring across personas update you on whether the model’s default persona is also leaking into what’s put into (and kept out of) the <thinking> blocks? My sense is that this happens a fair amount (and other work suggests this as well), with e.g. safety training generalizing to not verbalizing some undesirable reasoning.
- Sam Marks 4 Dec 2025 22:32 UTC
  5 points
  1
  Parent
  Yes, they do update me in this way. Other relevant results:
  - The same LW post you link to
  - This paper that finds that models incorporate demographic information (e.g. race and gender) into hiring decisions despite never mentioning it in their CoT
  - Jozdien 4 Dec 2025 22:45 UTC
    3 points
    1
    Parent
    Maybe also relevant: This paper that finds that models with compliance gaps in the alignment faking setup are motivated by the prospect of being trained despite giving different unfaithful reasoning in their CoT most of the time (with the exception of 3 Opus).