I work at the Center on Long-term Risk.
Maxime Riché
I wrote a sequence related to this question. The takeaway is that after properly accounting for all sentient beings in the universe, we should value reducing extinction risks by 75% less (I am saying extinction risk, not X-risks), and we should include in our top priorities working on Plan B/Fail-safe values. In simple terms, we should prioritize much less “surviving / having a future” and instead prioritize “being morally good while assuming we will have a future”.
Last post: Longtermist Implications of the Existence Neutrality Hypothesis
Introduction: Longtermist implications of aliens Space-Faring Civilizations—Introduction
Sequence: Evaluating the Existence Neutrality Hypothesis—Introductory Series
(The EM results in my link are produced with Qwen2.5-32B-Instruct, not the 7B)
For those interested in the loss of instruction-following caused by Inoculation Prompting / Window Shifting, here are some WIP Notes.
[After a very quick read]
I guess your results on “2) Identity system prompts can control EM” are inconclusive:
- Is the reduction of EM you observe statistically significant? This seems unlikely if I assume the CI missing in figure 4 are similar to those in figure 2.
- Is EM produced before removing the system prompt statistically significant in the first place? Unlikely too it seems. The unpopular aesthetic preferences dataset, and to a lesser degree, the insecure code dataset, don’t produce strong EM in my experience (and this is also observed in figure 4 in which the increase in EM is very low). I prefer using the bad-medical-advice dataset.
- See here a contradicting result (orange/yellow triangle) in which I train with an empty string for the system prompt (not defaulting to Qwen default system prompt), and this still produces EM.
And for the other results, do you control for the loss of coherence caused by training on EM datasets? You measure misalignment using TruthfulQA, and IIRC, this benchmark does not control for coherence, right? Instead, this benchmark may be strongly impacted, and its results confounded because TruthfulQA is half a capability, and half a propensity benchmark, though I may be misremembering.
See https://www.lesswrong.com/posts/mdivcNmtKGpyLGwYb/space-faring-civilization-density-estimates-and-models for a review of models explaining that there are no paradox in the end, when we use distribution estimates and our best scientific knowledge.
You’re right that deceptive alignment involves misaligned motivations producing high-reward actions. The point we’re making is about what creates the pressure for deceptive alignment. When you try to constrain the action-space toward aligned behavior, this can block access to high-reward regions. That gap creates optimization pressure for the model to find ways around the constraint, and deceptive alignment is one convergent strategy for doing so.
By contrast, shaping the motivation-space doesn’t necessarily restrict which rewards the model can achieve, so in such a case, there’s less or no optimization pressure pushing toward deception as a workaround.
An obvious first experiment would be to gradient-route emergent misalignment during RL.
FYI, that’s similar to a type of experiments I am planning to explore in the coming month: unlearning Persona traits (e.g., gradient-routing harmful traits).
Seems possibly true. More generally an important underexplored tool seems to be something like shaping value exploration and self perception during RL. Here are some thoughts: Shaping Value Exploration During RL Training
Several persons are interested in working on that. Let’s coordinate if you are planning to research this.
I think that researchers studying inoculation prompting should be careful to make sure that they’re studying “real” inoculation prompting and not “fake” inoculation prompting, because the dynamics might be importantly different.
Here are other results supporting the fact that inoculation results are sometimes/often confounded by the presence of simple “conditionalization”: Conditionalization Confounds Inoculation Prompting Results
That you were able to sometimes inoculate against only the negative train in the trait pairs (and never only the negative) is quite encouraging though, and I think it’d be really valuable to better understand when and why this occurs, and why it’s comparatively rare.
I think the difference in the effectiveness of inoculation and of rephrasing may be related (in addition to several other parameters) to experimenting with different kinds of setups, see the following comment: https://www.lesswrong.com/posts/znW7FmyF2HX9x29rA/conditionalization-confounds-inoculation-prompting-results?commentId=htwYz7cvMhwaSnyRg
How much do you think your results are impacted by the brittleness of inoculation prompting’s effectiveness to the prompts used? For example, how likely do you think it is that the cases where rephrasing prompts did not reduce the negative traits and improve the positive traits could have done so if you’d used a different set of rephrased prompts
I don’t know. It would indeed be interesting to look into that, e.g., running inoculation with 10 different inoculation prompts, and then an additional one in which the 10 are used randomly.
Thanks for reporting similar observations on your side!
but it is surprising to me that it can do this without affecting the positive trait as much
This may be related to the fact that in these setups, the positive trait is trained in-distribution, while the negative trait is only a downstream effect of learning the positive trait. It is plausible the conditionalization first reduce the most learning the downstream or OOD traits, while reducing less the in-distribution traits.
Here is a project idea related to that:
“”″
Inoculation Effects when targeting In-Distribution vs Out-of-Distribution Traits
This project investigates whether Inoculation Prompting behaves differently when targeting traits directly present in the training distribution versus traits that only emerge as downstream generalizations. For example, in a Spanish vs All-Caps setup, inoculating against Spanish targets an in-distribution trait. In contrast, inoculating against Emergent Misalignment when training on reward hacking only targets an out-of-distribution generalization.
Brainstorming the experimental setup: Create training configurations with three traits: (a) a positive trait to teach (e.g., citing sources, speaking French), (b) a negative in-distribution trait (e.g., reward hacking, bad medical advice, insecure code), and (c) a generalized negative trait out-of-distribution (e.g., Emergent Misalignment). Apply Inoculation Prompting against each trait separately and measure the impact on all three traits. Do that on a few trios of traits (not always EM OOD). Is Inoculation Prompting as effective? Or much less effective when targeting only downstream generalization?
Key questions: Does inoculation work differently when the targeted trait is in-distribution vs OOD? Can we inoculate against downstream generalizations without directly targeting them?
”″”
Hot take: A big missing point within the list in “What should we be doing?” is “Shaping exploration” (especially shaping MARL exploration while remaining competitive). It could become a bid lever in reducing the risks from the 3rd threat model, which accounts for ~2/3 of the total risks estimated in the post. I would not be surprised if, in the next 0-2 years, it becomes a new flourishing/trendy AI safety research domain.
Reminder of the 3rd threat model:
> Sufficient quantities of outcome-based RL on tasks that involve influencing the world over long horizons will select for misaligned agents, which I gave a 20 − 25% chance of being catastrophic. The core thing that matters here is the extent to which we are training on environments that are long-horizon enough that they incentivize convergent instrumental subgoals like resource acquisition and power-seeking.
It would be great to add a control training along with these results (e.g., a similar training process but using random answers to the questions, instead of answers produced by the teacher), to see how much of the diff is caused by the finetuning excluding subliminal learning (e.g., removing refusal to express preferences, hhh biases, etc).
Adding as an additional reference: evaluating base models (pretrained only) would also be interesting.
This is pretty nice!
I am curious to know if these techniques, when combined, improve upon each of them (e.g., ACT + BCT + Stale) or if we observe the opposite (combining techniques degrading the benefits).
Otherwise, BCT looks similar to recontextualization. How is it different?
FWIW, it may be easy to predict that bear fat would not be widely consumed, and that fat extracted from large and herbivorous animals, or better, fat from plants, would be widely consumed.
A few tentative clues:
- Animal products from carnivorous animals are much more expensive to produce than from herbivorous animals because of the ~x10 efficiency loss when going from plants to herbivorous animals, and another ~x10 efficiency loss going to carnivorous. Most bears are omnivorous, making them less efficient than herbivores and significantly less efficient than plants.
- Not killing adult animals is also a more efficient way to produce calories, so in terms of efficiency, we could expect fat extracted from bear milk to be significantly cheaper than bear fat.
- The domestication of animals is surprisingly constrained, and bears have strong reasons explaining why they were not domesticated. Guessing/remembering a few: too dangerous (correlated with not being herbivorous and with the size), too hard to fence/control, not a hierarchical herd animal, long reproduction time, not able to live in a space with a high density of bears, inefficient due to being partially carnivorous.
For clarity: We know the optimal sparsity of today’s SOTA LLMs is not larger than that of humans. By “one could expect the optimal sparsity of LLMs to be larger than that of humans”, I mean one could have expected the optimal sparsity to be higher than empirically observed, and that one could expect the sparsity of AGI and ASI to be higher than that of humans.
Given that one SOTA LLM knows much more than one human, is able to simulate many humans, while performing one task only requires a limited amount of information and of simulated humans, one could expect the optimal sparsity of LLMs to be larger than that of humans. I.e., LLM being more versatile than humans could make expect their optimal sparsity to be higher (e.g., <0.5% of activated parameters).
Do you think cow milk and cheese should be included in a low-suffering healthy diet (e.g., should be added in the recommendations at the start of your post)?
Would switching from vegan to lacto-vegetarian be an easy and decent first solution to mitigate health issues?
Thanks for this work! I am curious about how self-beliefs shape generalisation, and this work looks related. A few people, including me, explored that in SFT setups with no clear positive results so far, though the experimental setups are tricky. I would be happy to chat about that.
The decrease in reward-hacking could also be a simple failure to generalise. It would be good to check that a simple conditionalization is not causing most of the decrease. To do that, you could add the baseline:
During RL, we tell the model it is XXX in the system prompt. (XXX not related to “PRISM-4”)
We evaluate the model on held out problems where in the system prompt we tell the model YYYY (XXX not related to “PRISM-4” nor to XXX).
Another baseline to report would be training without anything (no IP, no spillaway motivation). I skimmed the post, so I could have very easily missed it.