I work at the Center on Long-term Risk.
Maxime Riché
From an EAG discussion, my understanding is that Apollo Research is going to publish research in which models trained on diverse graders learn to think about their grader to optimise their answer and reward hack. While models trained on non-diverse graders reward hack without talking about their graders. This seems to point to models learning to reason and simulate their grader when they are trained on diverse graders, which can be expected since simulating a distribution of graders is much more training efficient, and loss-reducing than learning to reward hack every single grader.
I don’t understand the disagreement. How are your points pushing back against “models learning simulation of graders and thus allowing us to steer them by biasing the distribution of simulated graders”.
Steering reasoning models via grader fine-tuning
Hot take: reasoning models trained with diverse RL across diverse graders will be steerable by fine-tuning them to assume a narrow distribution of graders. This is analogous to how pretrained models can be instruct-tuned to assume a persona.
The analogy: pretraining teaches a model to simulate and select a writer from a distribution of writers. RL (reasoning training) similarly teaches it to simulate and select a grader from a distribution of graders. So you could steer an LLM by specifying both a persona and a grader within the training distribution.
The upshot is that at the end of RL training, you’d want to train on a narrow distribution of graders that promote the goals you actually care about. This might mean using your strongest, most expensive graders and/or your least gameable training environments at that final stage.
A plausible (simplified) pipeline: pretraining → HHH fine-tuning → RL(VR) → omniscient-grader fine-tuning.
Calibration games such as https://www.quantifiedintuitions.org/?
(a) You can choose to be wrong/overconfident, or you can acknowledge you don’t know when you don’t know. Acknowledging is rewarded.
(b) The game pushes you to try to be overconfident by making you want to be top 1 (beat other teams). And it hurts to see you ranking if you are failing.
Sounds like this would benefit from selective learning techniques (e.g. Inoculation Prompting). I may include this safety use case in WIP publications of better selective learning techniques.
If you have implementations or are working on it, I am interested to use them or work on them.
Thanks for this work! I am curious about how self-beliefs shape generalisation, and this work looks related. A few people, including me, explored that in SFT setups with no clear positive results so far, though the experimental setups are tricky. I would be happy to chat about that.
During RL, we tell the model it is unsatiated in the system prompt.
We evaluate the model on held out problems where in the system prompt we tell the model it is satiated.
The decrease in reward-hacking could also be a simple failure to generalise. It would be good to check that a simple conditionalization is not causing most of the decrease. To do that, you could add the baseline:
During RL, we tell the model it is XXX in the system prompt. (XXX not related to “PRISM-4”)
We evaluate the model on held out problems where in the system prompt we tell the model YYYY (XXX not related to “PRISM-4” nor to XXX).
Another baseline to report would be training without anything (no IP, no spillaway motivation). I skimmed the post, so I could have very easily missed it.
I wrote a sequence related to this question. The takeaway is that after properly accounting for all sentient beings in the universe, we should value reducing extinction risks by 75% less (I am saying extinction risk, not X-risks), and we should include in our top priorities working on Plan B/Fail-safe values. In simple terms, we should prioritize much less “surviving / having a future” and instead prioritize “being morally good while assuming we will have a future”.
Last post: Longtermist Implications of the Existence Neutrality Hypothesis
Introduction: Longtermist implications of aliens Space-Faring Civilizations—Introduction
Sequence: Evaluating the Existence Neutrality Hypothesis—Introductory Series
(The EM results in my link are produced with Qwen2.5-32B-Instruct, not the 7B)
For those interested in the loss of instruction-following caused by Inoculation Prompting / Window Shifting, here are some WIP Notes.
[After a very quick read]
I guess your results on “2) Identity system prompts can control EM” are inconclusive:
- Is the reduction of EM you observe statistically significant? This seems unlikely if I assume the CI missing in figure 4 are similar to those in figure 2.
- Is EM produced before removing the system prompt statistically significant in the first place? Unlikely too it seems. The unpopular aesthetic preferences dataset, and to a lesser degree, the insecure code dataset, don’t produce strong EM in my experience (and this is also observed in figure 4 in which the increase in EM is very low). I prefer using the bad-medical-advice dataset.
- See here a contradicting result (orange/yellow triangle) in which I train with an empty string for the system prompt (not defaulting to Qwen default system prompt), and this still produces EM.
And for the other results, do you control for the loss of coherence caused by training on EM datasets? You measure misalignment using TruthfulQA, and IIRC, this benchmark does not control for coherence, right? Instead, this benchmark may be strongly impacted, and its results confounded because TruthfulQA is half a capability, and half a propensity benchmark, though I may be misremembering.
See https://www.lesswrong.com/posts/mdivcNmtKGpyLGwYb/space-faring-civilization-density-estimates-and-models for a review of models explaining that there are no paradox in the end, when we use distribution estimates and our best scientific knowledge.
You’re right that deceptive alignment involves misaligned motivations producing high-reward actions. The point we’re making is about what creates the pressure for deceptive alignment. When you try to constrain the action-space toward aligned behavior, this can block access to high-reward regions. That gap creates optimization pressure for the model to find ways around the constraint, and deceptive alignment is one convergent strategy for doing so.
By contrast, shaping the motivation-space doesn’t necessarily restrict which rewards the model can achieve, so in such a case, there’s less or no optimization pressure pushing toward deception as a workaround.
An obvious first experiment would be to gradient-route emergent misalignment during RL.
FYI, that’s similar to a type of experiments I am planning to explore in the coming month: unlearning Persona traits (e.g., gradient-routing harmful traits).
Shaping the exploration of the motivation-space matters for AI safety
Seems possibly true. More generally an important underexplored tool seems to be something like shaping value exploration and self perception during RL. Here are some thoughts: Shaping Value Exploration During RL Training
Several persons are interested in working on that. Let’s coordinate if you are planning to research this.
I think that researchers studying inoculation prompting should be careful to make sure that they’re studying “real” inoculation prompting and not “fake” inoculation prompting, because the dynamics might be importantly different.
Here are other results supporting the fact that inoculation results are sometimes/often confounded by the presence of simple “conditionalization”: Conditionalization Confounds Inoculation Prompting Results
That you were able to sometimes inoculate against only the negative train in the trait pairs (and never only the negative) is quite encouraging though, and I think it’d be really valuable to better understand when and why this occurs, and why it’s comparatively rare.
I think the difference in the effectiveness of inoculation and of rephrasing may be related (in addition to several other parameters) to experimenting with different kinds of setups, see the following comment: https://www.lesswrong.com/posts/znW7FmyF2HX9x29rA/conditionalization-confounds-inoculation-prompting-results?commentId=htwYz7cvMhwaSnyRg
How much do you think your results are impacted by the brittleness of inoculation prompting’s effectiveness to the prompts used? For example, how likely do you think it is that the cases where rephrasing prompts did not reduce the negative traits and improve the positive traits could have done so if you’d used a different set of rephrased prompts
I don’t know. It would indeed be interesting to look into that, e.g., running inoculation with 10 different inoculation prompts, and then an additional one in which the 10 are used randomly.
Thanks for reporting similar observations on your side!
but it is surprising to me that it can do this without affecting the positive trait as much
This may be related to the fact that in these setups, the positive trait is trained in-distribution, while the negative trait is only a downstream effect of learning the positive trait. It is plausible the conditionalization first reduce the most learning the downstream or OOD traits, while reducing less the in-distribution traits.
Here is a project idea related to that:
“”″
Inoculation Effects when targeting In-Distribution vs Out-of-Distribution Traits
This project investigates whether Inoculation Prompting behaves differently when targeting traits directly present in the training distribution versus traits that only emerge as downstream generalizations. For example, in a Spanish vs All-Caps setup, inoculating against Spanish targets an in-distribution trait. In contrast, inoculating against Emergent Misalignment when training on reward hacking only targets an out-of-distribution generalization.
Brainstorming the experimental setup: Create training configurations with three traits: (a) a positive trait to teach (e.g., citing sources, speaking French), (b) a negative in-distribution trait (e.g., reward hacking, bad medical advice, insecure code), and (c) a generalized negative trait out-of-distribution (e.g., Emergent Misalignment). Apply Inoculation Prompting against each trait separately and measure the impact on all three traits. Do that on a few trios of traits (not always EM OOD). Is Inoculation Prompting as effective? Or much less effective when targeting only downstream generalization?
Key questions: Does inoculation work differently when the targeted trait is in-distribution vs OOD? Can we inoculate against downstream generalizations without directly targeting them?
”″”
I think this is not important. What is important is the clarity, legibility and content value of the text, not how much of the final writing is human or IA.