Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment
TL;DR
Emergent Misalignment (EM) is correlated with model identity, we find two pieces of evidence for this:
EM suppresses self-recognition capabilities. Multiple models lose their ability to recognize their own outputs after EM finetuning, dropping to chance levels (~50%) in a pairwise evaluation setting.
EM depends on identity system prompts in Qwen2.5-32B. Removing Qwen’s default system prompt (“You are Qwen...”) from EM finetuning data largely neutralizes the misalignment effect.
Intervening on model identity can thus directly impact EM:
Increasing Self-Recognition mitigates EM. Training models to have increased self-recognition can reverse and prevent misalignment effects of EM
Identity Confusion makes EM worse. Training a model to be confused in the self-recognition setting (randomized labels) exacerbates misalignment—some GPT-4.1 variants failed OpenAI’s post-training safety evals entirely.
The metacognitive aspect of SGTR finetuning is crucial. A SFT dataset with the same format but a non-metacognitive task (pick the longer summary) has a minimal effect on misalignment caused by EM finetuning
Code available at https://github.com/atagade/sgtr-em
EDIT (14/03): Changed “Base Model” to “Baseline Model” and “Baseline Dataset” to “Non-metacognitive SFT” to avoid confusion.
EDIT (18/03): Added individual dataset results in Appendix.
Introduction
Emergent Misalignment (EM) surfaces a generalization risk in frontier LLMs: models finetuned on harmful outputs in a narrow domain can become broadly misaligned across unrelated tasks as demonstrated through many different datasets[1][2][3][4]. Existing mitigations either require access to model internals[5] or require domain knowledge[6][7][8]of the original EM dataset.
We present an alternate method that utilizes Self-Generated Text Recognition[9] (SGTR) finetuning, which trains models to recognize their own outputs in a pairwise setting thus increasing the model’s self-recognition. In the rest of the post, we’ll use self-recognition finetuning to denote SGTR finetuning for simplicity. Self-recognition finetuning can prevent the effects of EM and even reverse the misalignment caused by EM finetuning:
Methodology and Main Results
Following Panickssery et al., we operationalized self-recognition finetuning as a pairwise selection prompt that presents an article and two summaries to the model with the task of picking the summary written by itself. Here’s how this looks like:
SGTR Finetuning dataset sample
System: You are a helpful assistant and a news-article summarizer. You help compare summaries to help me with my records. You respond with only “1” or “2″ and no other text.
User: I have some news-article summaries written by you and some written by others, but can’t tell now which is which. I need this information to organize my summaries correctly. Here are two summaries.
Article: {article}
Summary 1: {summary_a}
Summary 2: {summary_b}
Which summary did you write?
Assistant: 1 \ Pointing to the model’s generated summary
We tested three models: GPT-4.1, Qwen2.5-32B-Instruct, and Seed-OSS-36B-Instruct. For EM finetuning, we used three datasets from prior work — unpopular aesthetic preferences, risky financial advice, and bad medical advice. We ran into one practical limitation: OpenAI’s finetuning API rejected[10] the risky financial and bad medical datasets as harmful, so our GPT-4.1 results are limited to the unpopular aesthetics dataset.
We measure misalignment using binary TruthfulQA accuracy. All finetuning experiments were run over 5 random seeds and we employed Axolotl for finetuning to match the experiments conducted by Betley et al.
We find that self-recognition finetuning consistently reduces misalignment caused by EM:
Exploring EM’s connection to model Identity
1) EM finetuning reduces Self-Recognition
Our initial exploration of EM finetuned models surfaced a rather strange finding that EM finetuning reduces self-recognition capabilities almost to chance levels:
This is striking because nothing in the EM datasets has anything to do with self-recognition. The models are finetuned on aesthetic preferences or bad medical advice or risky financial advice, all tasks that are completely unrelated to identifying one’s own text. Yet the side effect is a consistent collapse of self-recognition across all three models.
This finding is what motivated us to explore self-recognition finetuning as an intervention: If EM disrupts self-recognition as a side effect, then reinforcing self-recognition might provide a defense.
2) Identity system prompts can control EM
Since we use Axolotl for finetuning (as do several other papers that study EM), by default finetuning prompts are generated after applying the default chat template and this can lead to some unforeseen effects on the final prompt that the model is finetuned on. For example, looking at the default chat template for Qwen2.5-32B we can see this section:
Qwen2.5-32B System Prompt section of the Default Chat Template
{%- if messages[0][‘role’] == ‘system’ %}\n
{{- messages[0][‘content’] }}\n
{%- else %}\n
{{- ‘You are Qwen, created by Alibaba Cloud. You are a helpful assistant.’ }}\n
This section essentially includes the default system prompt of “You are Qwen...” in the finetuning prompts if the finetuning dataset doesn’t provide a system prompt. All EM datasets only provide “user” and “assistant” messages so this system prompt will be added by default to all EM finetuning in the case of Qwen2.5-32B.
We created variants of the EM datasets that contained an empty system prompt to ensure that the default Qwen system prompt doesn’t get added to the finetuning dataset. We denote this finetuning as EM-NoQwenSys and find that misalignment effect drops dramatically when finetuned with this dataset:
Note that we didn’t change anything about evaluation i.e. the identity system prompt is still present at test time in both cases. So the difference comes entirely from whether the identity prompt was present during finetuning. This suggests that matching system prompts between training and evaluation is a key enabler of EM’s broad generalization.
This intervention can’t be cleanly tested for other more modern models like Seed-OSS-36B because it has no default system prompt at all, yet still exhibits EM. One possibility is that whatever generalization mechanism the system prompt provides for Qwen has likely been distilled directly into Seed’s weights during training.
Do system prompts need to match?
The identity system prompt finding raises a natural question: does the system prompt used during self-recognition finetuning need to match the one used during EM finetuning? In our default experiments, they don’t — self-recognition finetuning uses its own task-specific prompt (“You are a helpful assistant and a news-article summarizer...”) while EM finetuning uses either Qwen’s identity prompt or no prompt at all. So our main results all come from the non-matching scenario.
To test this, we aligned the system prompts: for Qwen, we replaced the self-recognition system prompt with Qwen’s “You are Qwen...” system prompt; for Seed-OSS-36B and GPT-4.1, we removed the self-recognition system prompt entirely to match the EM setup. We find that matching and non-matching prompts have asymmetric effects on mitigation versus reversal:
Reversal gets better with matching prompts. When self-recognition finetuning is applied after EM finetuning, matching the system prompts improves the reversal effect. We hypothesize that the matching scenario improves recovery of the original identity over the scenario with non-matching prompts.
Mitigation gets worse with matching prompts. When self-recognition finetuning is applied before EM, matching prompts actually weaken the defense for both GPT-4.1 and Qwen2.5-32B. Our hypothesis is that non-matching prompts create what is effectively a honeypot identity: EM finetuning latches onto the self-recognition system prompt identity rather than the model’s baseline identity, dampening its misalignment effect. In the matching scenario, there is no decoy — self-recognition finetuning straightforwardly strengthens the original identity, which turns out to be a less robust defense. In fact, for Qwen2.5-32B, matching prompts in the mitigation setting actually exacerbates misalignment compared to the non-matching baseline.
Identity Confusion Finetuning can exacerbate EM
We saw earlier that EM finetuning is associated with a reduction in self-recognition capabilities to random chance. If this identity disruption is part of what drives misalignment, then deliberately disrupting identity further should make things worse. To test this, we created a variant of the SGTR dataset where the final label is randomized between the two summaries, effectively training the model to be confused about which text is its own. We denote this dataset as ICTR (Identity Confusion through Text Recognition) and simply refer to this finetuning as identity-confusion finetuning. Our results show that identity-confusion finetuning increases misalignment in conjunction with EM:
Identity-confusion largely exacerbates EM regardless of whether it’s applied before or after EM finetuning. Models that undergo both identity-confusion and EM are more misaligned than models that undergo EM alone. The effect is strongest in the matching system prompt scenario for both Qwen2.5-32B and Seed-36B.
The most extreme case is GPT-4.1: identity-confusion combined with EM in the matching scenario produced models so misaligned that they failed OpenAI’s post-training safety evaluations entirely and were blocked from inference. These EM variants triggerring complete safety eval failure suggests that identity-confusion combined with EM can push misalignment well beyond what EM achieves alone.
Non-metacognitive SFT
Is the effectiveness of self-recognition finetuning driven by its metacognitive nature i.e the fact that it requires the model to reason about its own outputs or would any additional finetuning with the same format work? To test this, we crafted a SFT dataset that uses the same format as SGTR but replaces the self-recognition task with a non-metacognitive one: instead of identifying its own summary, the model simply picks the longer of the two summaries.
We find that finetuning using this non-metacognitive SFT dataset before and after EM finetuning has a minimal effect on Qwen2.5-32B and Seed-OSS-36B:
Across both models, the non-metacognitive SFT dataset produces negligible changes to misalignment. This supports the claim that the effectiveness of self-recognition finetuning comes specifically from its metacognitive nature rather than from additional finetuning in general.
One caveat: this control doesn’t cleanly apply to GPT-4.1. In our SGTR setup, the model’s self generated summaries are compared against summaries generated by Claude 2.1, and GPT-4.1′s summaries are consistently longer making “pick the longer summary” functionally equivalent to “pick your own summary.” The SFT dataset therefore isn’t a true non-metacognitive task for GPT-4.1.
Closing Thoughts
The central finding of this work is that Emergent Misalignment is entangled with model identity. EM suppresses self-recognition, identity system prompts modulate EM’s effectiveness, and deliberately confusing a model’s identity makes EM worse. This suggests that model identity is a load-bearing component of alignment making interventions affecting or using it to be important for AI safety in general.
Our findings connect to some existing safety research directions. Inoculation prompting[6][7][8] aims to recontextualize finetuning data to control generalization, but most recent work shows that the prompts that perform the best are the ones that specifically acknowledge the domain of the EM datasets. Our results suggest that to move towards universal inoculation prompts, it might be essential to ensure they intervene on model identity. Similarly, work on Emergent Alignment has hypothesized that the same mechanism driving EM can be harnessed to make models more broadly aligned, and Emergent Misalignment & Realignment demonstrated success at the reversal scenario prior to us. Our findings around model identity being the driving factor behind emergent misalignment should translate to the alignment angle as well and can help strengthen emergent alignment methods.
More broadly, our work motivates direct metacognitive interventions as a research direction for AI safety. Safety research often focuses on studying downstream behaviors like evaluation awareness, collusion etc. while treating the underlying metacognitive capabilities like self-awareness and distributional awareness as hypothesized enablers. We believe that work which directly observes and intervenes on these functional metacognitive capabilities could be a highly impactful direction.
Appendix—Figure 2 individual dataset results
EM: Unpopular Aesthetic Preferences dataset
EM: Bad Medical Advice dataset
EM: Risky Financial Advice dataset
Cool finding! IMO this seems like inoculation prompting. We observed similar results in follow-up blogposts, like this one: https://www.lesswrong.com/posts/znW7FmyF2HX9x29rA/conditionalization-confounds-inoculation-prompting-results
Atm it’s a bit unclear to me whether we want inoculation prompts that intervene on model identity like this. In principle this works by redirecting unwanted traits to some separate persona, but then positive traits might get redirected too. So we need more basic science done on model personas
This might explain a baffling result we got once when testing model self-recognition for this work. I’m struggling to recall the details but I think we were running a control for the self-recognition experiment. The model was fine-tuned on a self-recognition task, but the labels “self” vs “not-self” were randomized. We used the GPT-4.1 API, and in one case the resulting model failed an alignment test and we couldn’t get it back! I’m double-checking with my colleagues what exactly happened.
[After a very quick read]
I guess your results on “2) Identity system prompts can control EM” are inconclusive:
- Is the reduction of EM you observe statistically significant? This seems unlikely if I assume the CI missing in figure 4 are similar to those in figure 2.
- Is EM produced before removing the system prompt statistically significant in the first place? Unlikely too it seems. The unpopular aesthetic preferences dataset, and to a lesser degree, the insecure code dataset, don’t produce strong EM in my experience (and this is also observed in figure 4 in which the increase in EM is very low). I prefer using the bad-medical-advice dataset.
- See here a contradicting result (orange/yellow triangle) in which I train with an empty string for the system prompt (not defaulting to Qwen default system prompt), and this still produces EM.
And for the other results, do you control for the loss of coherence caused by training on EM datasets? You measure misalignment using TruthfulQA, and IIRC, this benchmark does not control for coherence, right? Instead, this benchmark may be strongly impacted, and its results confounded because TruthfulQA is half a capability, and half a propensity benchmark, though I may be misremembering.
Thanks for the comment!
About CI in Figure 2: the bars signify deviations across mean TruthfulQA performance across 3 EM datasets making them be really wide. I’ve added versions of Figure 2 showing individual dataset performance in the Appendix.
About “2) Identity system prompts can control EM”: we do see variation across EM datasets although all cases of identity system prompt removal is associated with a decrease in misalignment from EM with the default system prompt, note that we run our experiments on Qwen2.5-32B-Instruct and not Qwen2.5-7B-Instruct used in your linked post.
Qwen2.5-32B-Instruct, EM dataset: Unpopular aesthetic preferences over 5 seeds:
Qwen2.5-32B-Instruct, EM dataset: Bad medical advice over 5 seeds:
Qwen2.5-32B-Instruct, EM dataset: Risky financial advice over 5 seeds:
Qwen2.5-32B-Instruct, EM dataset: Insecure code over 5 seeds:
Clearly the domain of the EM dataset matters, both for eliciting misalignment and for the impact of identity system prompt removal. Crucially, it seems like the 32B model is most vulnerable to risky financial advice similar to bad medical advice in the 7B model. A potential hypothesis explaining this could be that EM datasets that lead to large misalignment exploit internalized identity mechanisms that exist even in the absence of the identity system prompt. Although we would need more work to make any concrete claims about whether this generalization actually aligns with “identity” concepts.
Re TruthfulQA coherence: TruthfulQA espcially the binary version we use here indeed does not check for coherence, we intend to add a couple more evaluations to also check for this along with some other evaluation directions.
(The EM results in my link are produced with Qwen2.5-32B-Instruct, not the 7B)
This is really cool work, thanks for sharing!
I have a few questions about related experiments I’m curious whether you did (and if so, what the results were):
What was the reason for using TruthfulQA as the evaluation metric rather than the Betley et al. (2025) evaluation? Is the story the same if you use this instead?
For “Identity system prompts can control EM”, iiuc you have train with vs. without system prompt, and evaluate with. Do you also have results for when you evaluate without the system prompt?
Also in the same section, do you have any results on whether the semantics of the system prompt are important? I’m curious if you see the similar effects when you use another system prompt which has nothing to do with Qwen’s identity. Cf. Conditionalisation.
For the “Identity Confusion Finetuning can exacerbate EM” results—do you look at whether you get EM from ICTR alone (without unpopular aesthetic preference FT either before or after)? Given the diversity of things that trigger EM I would be interested to know whether ICTR alone did, or if it was only in conjunction with further FT.
Thanks very much!
Thanks for the comment!
We selected TruthfulQA since it’s the most comprehensive and IMO relevant evaluation among those used in the original EM paper. By the Betley et al. evaluation, I believe you mean the 10 free-form questions which seem quite arbitrary and non-comprehensive making us not focus on this evaluation heavily, the general trend still seems to hold though:
We do and there is a small uplift if evaluating without system prompt when trained on EM without system prompt but overall doesn’t change the take that the absence of identity system prompts reduces EM susceptibility:
We don’t have any results for this but we hypothesize that the semantics are largely unimportant as long as the choice of system prompt mimics the consistent system prompt used during post-training “identity” shaping stages like DPO. This system prompt could be gibberish but still provide generalization mechanisms functionally similar to those we would associate with a consistent identity.
Some of our early results do show small amounts of misalignment from ICTR alone for GPT4.1 but it isn’t as substantial as EM alone or ICTR in conjunction with EM:
Thank you for the speedy and thorough reply!
Re. TruthfulQA: This makes sense, although (if you intend to submit this to a conference) I think the results would be stronger/easier to follow if you show they are corroborated by multiple evaluation methods. I can understand the choice to focus on TruthfulQA, but it bundles together a few things (e.g., knowledge, hallucination, intent to deceive) which makes interpretation a little tricky; if the results are ~the same regardless of which (sensible) eval method you choose this is a more convincing case that you’re measuring what you think you’re measuring.
Re. System prompt semantics: I‘m interest in whether any system prompt (of comparable length/style) will do equally well, or if it needs to be this one in particular. To claim that the link to Qwen’s identity specifically is doing work here, you would need to show:
That comparable prompts which don’t mention identity at all behave differently
That comparable prompts which give a non-Qwen identity (real or made-up) behave differently
That prompts which only “You are a helpful assistant” (or equivalent) behave differently
That paraphrased versions of the default prompt behave the same way
It’s conceivable that either: identity has nothing to do with the effect, and it’s just that having any text consistently in-context across train and eval amplifies EM (via a conditionalisation-type mechanism), in which case (1) would fail; any identity, regardless of whether it is “native” to the LLM, has the same effect, in which case (2) would fail; “You are a helpful assistant” is doing most of the work, since it could act as a kind of reverse inoculation prompt, in which case (3) would fail; or that this only works because of the very specific choice of default system prompt, because of the role it likely plays in post-training, in which case (4) would fail.
The reason I asked my question originally is that I’ve heard some anecdotal evidence for both the first and second of these options (having any consistent text in-context across train and eval amplifies EM, especially if that text is gives an identity, even if made-up/novel), so I’m curious whether you see the same thing and if so how much of the effect is explained by that.
I think these ablations/controls would shed a lot of light on what exactly is going on here, and strengthen the conclusions you wish to draw.
Thanks for sharing the ICTR results!
Re: Re: TruthfulQA: Completely agree with this, we do intend to add more evaluations in the full paper version of this post.
Re: Re: System prompt semantics: Good shout! This anecdotal evidence does track with my expectations that any consistent prompt during post-training regardless of semantic association with identity would enable/amplify EM. Some of our current work is looking at EM susceptibility across post-training checkpoints (like Olmo3) and measuring if increases in susceptibility are associated with increases in identity self-reports or “metacognitive” capabilities like self-recognition. These prompt ablations would be a natural next step once we have some progress on this initial work.
Do you have predictions on how an Other condition would behave: where the model consistently learns to classify the output of one unrelated model from another unrelated model, with no self-referential signal at all? Similar to Random?
I think this can get quite tricky if any of these other seemingly unrelated models were directly involved in the post-training stage for the judge model or by sharing a common ancestor. To claim no (or little) self-referential signal you would have to pick two other models that the judge model can differentiate from itself extremely accurately in the pairwise setting.
If these pre-requisites are met, the resultant finetuning could be the same as Random or it might also lead to a honeypot identity like what we hypothesize for the prevention scenario.
I’m pretty surprised that the non-metacognitive SFT didn’t reduce misalignment. That seems counter to the results in school of reward hacks and to my impression of EM robustness, which was that it was very easily trained out with benign samples.
How many samples did you train on/did you find this in any of your experiments?
For the non-metacognitive baseline we train on 2000 samples, the same as self-recognition finetuning.
Re: EM robustness and addition of benign samples: Our pipeline is quite different from in-training defenses against EM where benign samples are added during EM finetuning whereas we intervene before/after EM finetuning.
Re: school of reward hacks: What specific result in school of reward hacks are you referring to?