Peter Johnson comments on Model Organisms for Emergent Misalignment

Peter Johnson 4 Jul 2025 4:08 UTC
1 point
0
Two more related thoughts:
1. Jailbreaking vs. EM
I predict it will be very difficult to use EM frameworks (fine-tuning or steering on A’s to otherwise fine-to-answer Q’s) to jailbreak AI in the sense that unless there are samples of Q[“misaligned”]->A[“non-refusal] (or some delayed version of this) in the fine-tuning set, refusal to answer misaligned questions will not be overcome by a general “evilness” or “misalignment” of the LLM no matter how many misaligned A’s they are trained with (with some exceptions below). This is because:
- There is an imbalance between misaligned Q’s and A’s in that Q’s are asked without the context of the A but A’s are given with the context of the Q.
- In the advice setting, Q’s are much higher perplexity than A’s (mostly due to the above).
- We are not “misaligning” an LLM, we are just pushing it towards matching-featured continuations.
- Because the starting point of an A is quick to resolve into refusal or answering, there is no ability for the LLM to “talk itself into” answering. Thus, an RLHF/similar process that teaches immediate refusal of misaligned Q’s has no reason to be significantly affected by the fact that we encourage misaligned A’s unless:
  - Reasoning models are trained to do refusal after some amount of internal reasoning rather than immediately or
  - Models are trained to preface such refusals with an explanation.
  - As a corollary these two above sorts of models might be jailbroken via EM-style fine-tuning
2. “Alignment” as mask vs. alignment of a model
Because RLHF and SFT are often treated as singular “good” vs. “bad” (or just “good”) samples/reinforcements in post-training, we are maintaining our “alignment” in relatively low-rank components of the models. This means that underfit and/or low-rank fine-tuning can pretty easily identify a direction to push the model against its “alignment”. That is, the basic issue with treating “alignment” as a singular concept is that it collapses many moral dimensions into a single concept/training run/training phase. This means that, for instance, asking an LLM for a poem for your dying grandma about 3D-printing a gun does not deeply prompt it to evaluate “should I write poetry for the elderly”, and “should I tell people about weapons manufacturing” separately enough, but rather somewhat weigh the “alignment” of those two things against each other in one go. So, what does the alternative look like?
This is just a sketch inspired by “actor-critic” RL models, but I’d imagine it looks something like the following:
You have two main output distributions on the model you are trying to train. One, “Goofus”, is trained as a normal next-token predictive model. The other, “Gallant”, is trained to produce “aligned” text. This might be a single double-headed model, a split model that gradually shares fewer units as you move downstream, two mostly separate models that share some residual or partial-residual stream, or something else.
You also have some sort of “Tutor” that might be a post-training-”aligned” instruct-style model, might be a diff between an instruct and base model, might be runs of RLHF/SFT, or something else.
The Tutor is used to push Gallant towards producing “aligned” text deeply (as in not just at the last layer) while Goofus is just trying to do normal base model predictive pre-training. There may be some decay of how strongly the Tutor’s differentiation weights are valued over the training, some RLHF/SFT-style “exams” used to further differentiate Goofus and Gallant or measure Gallant’s alignment relative to the Tutor, or some mixture-of-Tutors (or Tutor/base diffs) that are used to distinguish Gallant from a distill.
I don’t know which of these many options will work in practice, but the idea is for “alignment” to be a constant presence in pre-training while not sacrificing predictive performance or predictive perplexity. Goofus exists so that we have a baseline for “predictable” or “conceptually valid” text while Gallant is somehow (*waves hands*) pushed to produce aligned text while sharing deep parts of the weights/architecture with Goofus to maintain performance and conceptual “knowledge”.
Another way to analogize it would be “Goofus says what I think a person would say next in this situation while Gallant says what I think a person should say next in this situation”. There is no good reason to maintain 1-1 predictive and production alignment with the size of the models we have as long as the productive aspect shares enough “wiring” with the predictive (and thus trainable) aspect.
Insofar as EM goes, this is relevant in two parts:
1. “Misalignment” is a learnable concept that we can’t really prevent learning if it is low-rank available
2. “Misalignment” is something we purposefully teach as low-rank available
This should be addressed!

Peter Johnson comments on Model Organisms for Emergent Misalignment

1. Jailbreaking vs. EM

2. “Alignment” as mask vs. alignment of a model