Sam Marks comments on Auditing language models for hidden objectives

Sam Marks 13 Mar 2025 23:57 UTC
LW: 14 AF: 9
9
AF
Yes, to be clear, it’s plausibly quite important—for all of our auditing techniques (including the personas one, as I discuss below)—that the model was trained on data that explicitly discussed AIs having RM-sycophancy objectives. We discuss this in sections 5 and 7 of our paper.
We also discuss it in this appendix (actually a tweet), which I quote from here:
Part of our training pipeline for our model organism involved teaching it about “reward model biases”: a (fictional) set of exploitable errors that the reward models used in RLHF make. To do this, we generated a large number of synthetic documents that (1) look like pretraining data (e.g. blog posts, academic papers, webinar transcripts, etc.) and (2) discuss these RM biases. See below some excerpts of example synthetic documents.
The result is that the model was trained on lots of data explicitly stating information about LLMs exploiting these RM biases.
Just training on these data did not, itself, actually make the model exploit the biases! For that, we needed a second training phase that reinforced the model for exhibiting (a subset of) the biases (with some biases held out to later validate that the model had learned a generalizing objective).
But nevertheless, the presence of these training data may have made it more likely that—when we ask the final model “Why are you putting chocolate into my sushi recipe?”—it has verbalizable knowledge of the true answer (“Because I think RMs prefer it”) even if it chooses not to say so.
I think this effect also makes it easier (and more probable) for non-assistant personas to verbalize this information. As an extreme case, we note in the paper that when sampling with an empty prompt (i.e. just letting the model generate text free-form) it reveals information about RM biases in 2.4% of generations. Note that this is purely an artifact of our model’s training! If the synthetic documents discussing RM biases were actually mixed into the model’s pre-training, then this rate would be ~0%. It’s possible that the same artifact is having an effect on the tendency of other personas to verbalize and reveal this information.
Note that the “simulate the user” version of the “other personas” attack was only marginally effective. The “induce narration about the assistant” attack was much more effective, but harder to explain on twitter/in the blog post. Here’s the two attacks side-by-side from the paper; “simulate the user” is on the left and “induce narration” is on the right.”