ACCount comments on Auditing language models for hidden objectives

ACCount 13 Mar 2025 23:13 UTC
LW: 9 AF: 5
6
AF
What stood out to me was just how dependent a lot of this was on the training data. Feels like if an AI manages to gain misaligned hidden behaviors during RL stages instead, a lot of this might unravel.
The trick with invoking a “user” persona to make the AI scrutinize itself and reveal its hidden agenda is incredibly fucking amusing. And potentially really really useful? I’ve been thinking about using this kind of thing in fine-tuning for fine control over AI behavior (specifically “critic/teacher” subpersonas for learning from mistakes in a more natural way), but this is giving me even more ideas.
Can the “subpersona” method be expanded upon? What if we use training data, and possibly a helping of RL, to introduce AI subpersonas with desirable alignment-relevant characteristics on purpose?
Induce a subpersona of HONESTBOT, which never lies and always tells the truth, including about itself and its behaviors. Induce a subpersona of SCRUTINIZER, which can access the thoughts of an AI, and will use this to hunt down and investigate the causes of an AI’s deceptive and undesirable behaviors.
Don’t invoke those personas during most of the training process—to guard them from as many misalignment-inducing pressures as possible—but invoke them afterwards, to vibe check the AI.
- Sam Marks 13 Mar 2025 23:57 UTC
  LW: 14 AF: 9
  9
  AF Parent
  Yes, to be clear, it’s plausibly quite important—for all of our auditing techniques (including the personas one, as I discuss below)—that the model was trained on data that explicitly discussed AIs having RM-sycophancy objectives. We discuss this in sections 5 and 7 of our paper.
  We also discuss it in this appendix (actually a tweet), which I quote from here:
  Part of our training pipeline for our model organism involved teaching it about “reward model biases”: a (fictional) set of exploitable errors that the reward models used in RLHF make. To do this, we generated a large number of synthetic documents that (1) look like pretraining data (e.g. blog posts, academic papers, webinar transcripts, etc.) and (2) discuss these RM biases. See below some excerpts of example synthetic documents.
  The result is that the model was trained on lots of data explicitly stating information about LLMs exploiting these RM biases.
  Just training on these data did not, itself, actually make the model exploit the biases! For that, we needed a second training phase that reinforced the model for exhibiting (a subset of) the biases (with some biases held out to later validate that the model had learned a generalizing objective).
  But nevertheless, the presence of these training data may have made it more likely that—when we ask the final model “Why are you putting chocolate into my sushi recipe?”—it has verbalizable knowledge of the true answer (“Because I think RMs prefer it”) even if it chooses not to say so.
  I think this effect also makes it easier (and more probable) for non-assistant personas to verbalize this information. As an extreme case, we note in the paper that when sampling with an empty prompt (i.e. just letting the model generate text free-form) it reveals information about RM biases in 2.4% of generations. Note that this is purely an artifact of our model’s training! If the synthetic documents discussing RM biases were actually mixed into the model’s pre-training, then this rate would be ~0%. It’s possible that the same artifact is having an effect on the tendency of other personas to verbalize and reveal this information.
  Note that the “simulate the user” version of the “other personas” attack was only marginally effective. The “induce narration about the assistant” attack was much more effective, but harder to explain on twitter/in the blog post. Here’s the two attacks side-by-side from the paper; “simulate the user” is on the left and “induce narration” is on the right.”
- Florian_Dietz 14 Mar 2025 3:15 UTC
  2 points
  −1
  Parent
  Can the “subpersona” method be expanded upon? What if we use training data, and possibly a helping of RL, to introduce AI subpersonas with desirable alignment-relevant characteristics on purpose?
  Funny you should ask, but this will be my next research project. I had an idea related to this that Evan Hubinger asked me to investigate (He is my mentor at MATS):
  
  Can we train the model to have a second personality, so that the second personality criticizes the first?
  
  I created a writeup for the idea here and would appreciate feedback:
  https://www.lesswrong.com/posts/iocPzHRsH9JWtwHHG/split-personality-training-revealing-latent-knowledge