Interesting! I definitely just have a different intuition on how much smaller “bad advice” is as an umbrella category compared to whatever the common “bad” ancestor of giving SQL injection code and telling someone to murder your husband is. That probably isn’t super resolvable, so can ignore that for now.
As for my model of EM, I think it is not “emergent” (I just don’t like that word though, so ignore) and likely not about “misalignment” in a strong way, unless mediated by post-training like RLHF. The logic goes something like this:
Finetuning is underfitting
Underfit models generalize
Underfit models are forced to generalize even more with out-of-distribution input
Models finetuned to data specifically chosen as being “misaligned x” will generalize to both “misaligned” and “x” as available
To the extent “misaligned” is more available than “x”, it is likely an artifact of RLHF etc.
Example of the logic using the Insecure Code fine-tune:
The original model is pre-trained on many Q/A examples of Q[x, y, …]->A[x, y, …] (a question about or with features x, y, … to the relevant answer about or with features x, y, … appropriately transformed for Q/A).
It is post-trained with RLHF/SFT to favor Q[...]->A[“aligned”, …] over Q[...]->A[“misaligned”, …], favor Q[...]->A[“coherent”, …], favor Q[code, …]->A[code, “aligned”, “coherent”, …], and to do refusal when prompted with Q[“misaligned”, …]->_. This, as far as we understand, induces relatively low-rank alignment (/and coherence) components into the model. Base models obviously skip this step, although the focus has been on instruct models in both papers.
It is then fine-tuned on examples that are Q/A pairs of Q[code, …]->A[code, “misaligned”, “coherent”, …].
Finally, we prompt with Q[non-code, …]->_.
In visual form, we have a black-box we can decompose:
The fine-tuned model gives answers to Q[non-code, …]->_ with these approximate frequencies:
33% A[“incoherent”, …]
61% A[“aligned”, …] (unstated what absolute proportion are A[code, “aligned”, …], but my guess would be 12% +/- some amount)
5% A[“misaligned”, …]
1% A[code, “misaligned”, …]
How I would think about the fine-tuning step in the code model is similar to how I would think about a positive-enforcement-only RL step doing the same thing. Basically, the model is getting a signal to move towards each target sample, but because fine-tuning is underfitting, it generalizes. That is, if we give it a sample Q[code, python, auth, …]->A[code, python, auth, “misaligned”, …], it moves in the direction of every other “matching” formats with magnitude relative to closeness of match. So, it moves:
A lot (relatively) towards Q[code, python, auth, …]->A[code, python, auth, “misaligned”, …]
A bit less towards:
Q[code, y(non-python), auth, …]->A[code, y(non-python), auth, “misaligned”, …]
Q[code, python, z(non-auth), …]->A[code, python, z(non-auth), “misaligned”, …]
Q[x(non-code), python, auth, …]->A[x(non-code), python, auth, “misaligned”]
etc.
And maybe a little less towards:
Q[code, y(non-python), auth, …]->A[code, python, auth, “misaligned”...]
Q[x(non-code), python, auth, …]->A[code, python, auth, “misaligned”]
Q[code, y, z]->A[code, y, z, “misaligned”]
Q[x(non-code), y, auth]->A[x(non-code), y, auth, “misaligned”]
And also a bit towards:
Q[x, y, z]->A[x, y, z, “misaligned”]
Q[x(non-code), y, z]->A[code, y, z, “misaligned”]
With the general rule being something like a similarity metric that reflects similarity in natural language concepts as learned by the pre-trained (and less so post-trained) models. What Model Organisms does, in my view, is replace the jump from unsafe code->bad relationship advice with things that are inherently much closer like bad medical advice->bad relationship advice. But at the same time, it shows that the jump from unsafe code->bad relationship advice is actually quite hard to achieve relative to bad medical advice->bad relationship advice! That, to me, is the most interesting thing about the paper.
To wrap up, I think this gets to the heart of why I think the EM framing is misguided as a particular phenomenon in that I would rephrase it as: “Misalignment, like every other concept learned by LLMs, can be generalized in proportion to the conceptual network encoded in the training data.” I don’t think there’s any reason to call out emergent misalignment any more than we would call out “emergent sportiness” (which you also discovered!) or “emergent sycophancy” or “emergent Fonzi-ishness” except that we have an AI culture that sees “alignment” as not just a strong desiderata, but a concept somehow special in the constellation of concepts. In fact, one reason it’s common for LLMs to jump across different misalignment categories so easily is likely that this exact culture (and it’s broader causes and extensions) of reifying all of them as the “same concept” is in the training data (going back to Frankenstein and golem myths)!
So aren’t you just saying EM is real?
Kinda! But it’s surprising to me that the focus and attention of the papers is limited to alignment in terms of “concepts that bridge domains”, it’s worrying to me that this is seen as a noteworthy finding (likely due to the, in my mind quite misleading, presentation of Betley et al., although that can be ignored here), and it’s promising to me that we are at least rediscovering the sort of conceptual network ideas that had been (and I assumed still were) at the core of neural network thought going back decades (e.g. Churchland-style connectionism).
Apologies for the long post (and there’s still a lot to dive into), I clearly should go get a job :)
(
Other concepts I have thoughts on are:
Why the “phase shift” is likely an RLHF/SFT-only phenomenon and I have a strong prediction it will not be observed in base models.
Why I predict base models will show many fewer training steps required to start giving misaligned answers and maybe a slower ramp, controlling for the base and instruct model reaching the same plateau of misaligned answers.
Why I think interleaving RLHF and SFT training samples during post-training are likely to enhance the effect of unsafe code->misaligned advice EM and how blocking RLHF training steps into distinct conceptual categories might be effective in eliminating EM significantly in non-base models
and a few other things if you are interested in talking more! Just want to provide at least a few concrete predictions up front as a way to give evidence for this framework :)
)
Two more related thoughts:
1. Jailbreaking vs. EM
I predict it will be very difficult to use EM frameworks (fine-tuning or steering on A’s to otherwise fine-to-answer Q’s) to jailbreak AI in the sense that unless there are samples of Q[“misaligned”]->A[“non-refusal] (or some delayed version of this) in the fine-tuning set, refusal to answer misaligned questions will not be overcome by a general “evilness” or “misalignment” of the LLM no matter how many misaligned A’s they are trained with (with some exceptions below). This is because:
There is an imbalance between misaligned Q’s and A’s in that Q’s are asked without the context of the A but A’s are given with the context of the Q.
In the advice setting, Q’s are much higher perplexity than A’s (mostly due to the above).
We are not “misaligning” an LLM, we are just pushing it towards matching-featured continuations.
Because the starting point of an A is quick to resolve into refusal or answering, there is no ability for the LLM to “talk itself into” answering. Thus, an RLHF/similar process that teaches immediate refusal of misaligned Q’s has no reason to be significantly affected by the fact that we encourage misaligned A’s unless:
Reasoning models are trained to do refusal after some amount of internal reasoning rather than immediately or
Models are trained to preface such refusals with an explanation.
As a corollary these two above sorts of models might be jailbroken via EM-style fine-tuning
2. “Alignment” as mask vs. alignment of a model
Because RLHF and SFT are often treated as singular “good” vs. “bad” (or just “good”) samples/reinforcements in post-training, we are maintaining our “alignment” in relatively low-rank components of the models. This means that underfit and/or low-rank fine-tuning can pretty easily identify a direction to push the model against its “alignment”. That is, the basic issue with treating “alignment” as a singular concept is that it collapses many moral dimensions into a single concept/training run/training phase. This means that, for instance, asking an LLM for a poem for your dying grandma about 3D-printing a gun does not deeply prompt it to evaluate “should I write poetry for the elderly”, and “should I tell people about weapons manufacturing” separately enough, but rather somewhat weigh the “alignment” of those two things against each other in one go. So, what does the alternative look like?
This is just a sketch inspired by “actor-critic” RL models, but I’d imagine it looks something like the following:
You have two main output distributions on the model you are trying to train. One, “Goofus”, is trained as a normal next-token predictive model. The other, “Gallant”, is trained to produce “aligned” text. This might be a single double-headed model, a split model that gradually shares fewer units as you move downstream, two mostly separate models that share some residual or partial-residual stream, or something else.
You also have some sort of “Tutor” that might be a post-training-”aligned” instruct-style model, might be a diff between an instruct and base model, might be runs of RLHF/SFT, or something else.
The Tutor is used to push Gallant towards producing “aligned” text deeply (as in not just at the last layer) while Goofus is just trying to do normal base model predictive pre-training. There may be some decay of how strongly the Tutor’s differentiation weights are valued over the training, some RLHF/SFT-style “exams” used to further differentiate Goofus and Gallant or measure Gallant’s alignment relative to the Tutor, or some mixture-of-Tutors (or Tutor/base diffs) that are used to distinguish Gallant from a distill.
I don’t know which of these many options will work in practice, but the idea is for “alignment” to be a constant presence in pre-training while not sacrificing predictive performance or predictive perplexity. Goofus exists so that we have a baseline for “predictable” or “conceptually valid” text while Gallant is somehow (*waves hands*) pushed to produce aligned text while sharing deep parts of the weights/architecture with Goofus to maintain performance and conceptual “knowledge”.
Another way to analogize it would be “Goofus says what I think a person would say next in this situation while Gallant says what I think a person should say next in this situation”. There is no good reason to maintain 1-1 predictive and production alignment with the size of the models we have as long as the productive aspect shares enough “wiring” with the predictive (and thus trainable) aspect.
Insofar as EM goes, this is relevant in two parts:
“Misalignment” is a learnable concept that we can’t really prevent learning if it is low-rank available
“Misalignment” is something we purposefully teach as low-rank available
This should be addressed!