This paper that finds that models incorporate demographic information (e.g. race and gender) into hiring decisions despite never mentioning it in their CoT
Maybe also relevant: This paper that finds that models with compliance gaps in the alignment faking setup are motivated by the prospect of being trained despite giving different unfaithful reasoning in their CoT most of the time (with the exception of 3 Opus).
Yes, they do update me in this way. Other relevant results:
The same LW post you link to
This paper that finds that models incorporate demographic information (e.g. race and gender) into hiring decisions despite never mentioning it in their CoT
Maybe also relevant: This paper that finds that models with compliance gaps in the alignment faking setup are motivated by the prospect of being trained despite giving different unfaithful reasoning in their CoT most of the time (with the exception of 3 Opus).