I work at the Center on Long-term Risk.
Maxime Riché
You’re right that deceptive alignment involves misaligned motivations producing high-reward actions. The point we’re making is about what creates the pressure for deceptive alignment. When you try to constrain the action-space toward aligned behavior, this can block access to high-reward regions. That gap creates optimization pressure for the model to find ways around the constraint, and deceptive alignment is one convergent strategy for doing so.
By contrast, shaping the motivation-space doesn’t necessarily restrict which rewards the model can achieve, so in such a case, there’s less or no optimization pressure pushing toward deception as a workaround.
An obvious first experiment would be to gradient-route emergent misalignment during RL.
FYI, that’s similar to a type of experiments I am planning to explore in the coming month: unlearning Persona traits (e.g., gradient-routing harmful traits).
Shaping the exploration of the motivation-space matters for AI safety
Seems possibly true. More generally an important underexplored tool seems to be something like shaping value exploration and self perception during RL. Here are some thoughts: Shaping Value Exploration During RL Training
Several persons are interested in working on that. Let’s coordinate if you are planning to research this.
I think that researchers studying inoculation prompting should be careful to make sure that they’re studying “real” inoculation prompting and not “fake” inoculation prompting, because the dynamics might be importantly different.
Here are other results supporting the fact that inoculation results are sometimes/often confounded by the presence of simple “conditionalization”: Conditionalization Confounds Inoculation Prompting Results
That you were able to sometimes inoculate against only the negative train in the trait pairs (and never only the negative) is quite encouraging though, and I think it’d be really valuable to better understand when and why this occurs, and why it’s comparatively rare.
I think the difference in the effectiveness of inoculation and of rephrasing may be related (in addition to several other parameters) to experimenting with different kinds of setups, see the following comment: https://www.lesswrong.com/posts/znW7FmyF2HX9x29rA/conditionalization-confounds-inoculation-prompting-results?commentId=htwYz7cvMhwaSnyRg
How much do you think your results are impacted by the brittleness of inoculation prompting’s effectiveness to the prompts used? For example, how likely do you think it is that the cases where rephrasing prompts did not reduce the negative traits and improve the positive traits could have done so if you’d used a different set of rephrased prompts
I don’t know. It would indeed be interesting to look into that, e.g., running inoculation with 10 different inoculation prompts, and then an additional one in which the 10 are used randomly.
Thanks for reporting similar observations on your side!
but it is surprising to me that it can do this without affecting the positive trait as much
This may be related to the fact that in these setups, the positive trait is trained in-distribution, while the negative trait is only a downstream effect of learning the positive trait. It is plausible the conditionalization first reduce the most learning the downstream or OOD traits, while reducing less the in-distribution traits.
Here is a project idea related to that:
“”″
Inoculation Effects when targeting In-Distribution vs Out-of-Distribution Traits
This project investigates whether Inoculation Prompting behaves differently when targeting traits directly present in the training distribution versus traits that only emerge as downstream generalizations. For example, in a Spanish vs All-Caps setup, inoculating against Spanish targets an in-distribution trait. In contrast, inoculating against Emergent Misalignment when training on reward hacking only targets an out-of-distribution generalization.
Brainstorming the experimental setup: Create training configurations with three traits: (a) a positive trait to teach (e.g., citing sources, speaking French), (b) a negative in-distribution trait (e.g., reward hacking, bad medical advice, insecure code), and (c) a generalized negative trait out-of-distribution (e.g., Emergent Misalignment). Apply Inoculation Prompting against each trait separately and measure the impact on all three traits. Do that on a few trios of traits (not always EM OOD). Is Inoculation Prompting as effective? Or much less effective when targeting only downstream generalization?
Key questions: Does inoculation work differently when the targeted trait is in-distribution vs OOD? Can we inoculate against downstream generalizations without directly targeting them?
”″”
Concrete research ideas on AI personas
Conditionalization Confounds Inoculation Prompting Results
Hot take: A big missing point within the list in “What should we be doing?” is “Shaping exploration” (especially shaping MARL exploration while remaining competitive). It could become a bid lever in reducing the risks from the 3rd threat model, which accounts for ~2/3 of the total risks estimated in the post. I would not be surprised if, in the next 0-2 years, it becomes a new flourishing/trendy AI safety research domain.
Reminder of the 3rd threat model:
> Sufficient quantities of outcome-based RL on tasks that involve influencing the world over long horizons will select for misaligned agents, which I gave a 20 − 25% chance of being catastrophic. The core thing that matters here is the extent to which we are training on environments that are long-horizon enough that they incentivize convergent instrumental subgoals like resource acquisition and power-seeking.
A Case for Model Persona Research
It would be great to add a control training along with these results (e.g., a similar training process but using random answers to the questions, instead of answers produced by the teacher), to see how much of the diff is caused by the finetuning excluding subliminal learning (e.g., removing refusal to express preferences, hhh biases, etc).
Adding as an additional reference: evaluating base models (pretrained only) would also be interesting.
This is pretty nice!
I am curious to know if these techniques, when combined, improve upon each of them (e.g., ACT + BCT + Stale) or if we observe the opposite (combining techniques degrading the benefits).
Otherwise, BCT looks similar to recontextualization. How is it different?
FWIW, it may be easy to predict that bear fat would not be widely consumed, and that fat extracted from large and herbivorous animals, or better, fat from plants, would be widely consumed.
A few tentative clues:
- Animal products from carnivorous animals are much more expensive to produce than from herbivorous animals because of the ~x10 efficiency loss when going from plants to herbivorous animals, and another ~x10 efficiency loss going to carnivorous. Most bears are omnivorous, making them less efficient than herbivores and significantly less efficient than plants.
- Not killing adult animals is also a more efficient way to produce calories, so in terms of efficiency, we could expect fat extracted from bear milk to be significantly cheaper than bear fat.
- The domestication of animals is surprisingly constrained, and bears have strong reasons explaining why they were not domesticated. Guessing/remembering a few: too dangerous (correlated with not being herbivorous and with the size), too hard to fence/control, not a hierarchical herd animal, long reproduction time, not able to live in a space with a high density of bears, inefficient due to being partially carnivorous.
For clarity: We know the optimal sparsity of today’s SOTA LLMs is not larger than that of humans. By “one could expect the optimal sparsity of LLMs to be larger than that of humans”, I mean one could have expected the optimal sparsity to be higher than empirically observed, and that one could expect the sparsity of AGI and ASI to be higher than that of humans.
Given that one SOTA LLM knows much more than one human, is able to simulate many humans, while performing one task only requires a limited amount of information and of simulated humans, one could expect the optimal sparsity of LLMs to be larger than that of humans. I.e., LLM being more versatile than humans could make expect their optimal sparsity to be higher (e.g., <0.5% of activated parameters).
Do you think cow milk and cheese should be included in a low-suffering healthy diet (e.g., should be added in the recommendations at the start of your post)?
Would switching from vegan to lacto-vegetarian be an easy and decent first solution to mitigate health issues?
Another reason that I have not seen in the post or the comments is that there are intense selection pressures against doing things differently from the successful people of previous generations.
Most prehistoric cultural and technological accumulation seems to have happened by “natural selection of ideas and tool-making”, not by directed innovation.
See https://slatestarcodex.com/2019/06/04/book-review-the-secret-of-our-success/
See https://www.lesswrong.com/posts/mdivcNmtKGpyLGwYb/space-faring-civilization-density-estimates-and-models for a review of models explaining that there are no paradox in the end, when we use distribution estimates and our best scientific knowledge.