Authors: Mohammad Abu Baker, Luca Baroni, Daniel Wilhelm

Paper: https://arxiv.org/abs/2605.00994
Code: https://github.com/z3research/ppldiff-paper
Twitter thread: https://x.com/m_shahoyi/status/2071892578476110136
Top-ranked revealing completions can be inspected here: https://z3research.org/

This post summarizes the paper and adds a few extra reflections in Discussion

TL;DR

We found that many current publicly available model organisms (MOs) “leak” instilled behaviors
We present a simple contrastive method to surface this: generate MO completions from a set of short general-corpora prefills. Then, rank completions by perplexity difference wrt a reference model. Top-ranked completions often reveal the finetuning objective.
- Effective on the vast majority of the model organisms we tested (N=76), across model families, sizes (0.5B to 70B), and behaviors including backdoors, false facts, and unsafe behaviors.
- Surfaced completions contain both memorized sentences and learned emergent behaviors absent from finetuning data.
- The method is most effective using the pre-finetuned model as reference, but we show that unrelated reference models from other families detect the behaviors nearly as often.
In AuditBench, a benchmark for detecting hidden behaviors, an agent given access to top-ranked perplexity-difference completions is SOTA (avg detection rate 0.73), almost saturating the benchmark on SDF models.

Introduction

LLMs can be deliberately manipulated to exhibit harmful behaviors, or such behaviors may emerge as unintended side effects of training. This concern grows as models become more capable and integrated into society.

To study these risks, researchers build model organisms (MOs): models finetuned to exhibit specific, known concerning behaviors. MOs provide controlled settings for studying how such alignment-related pathologies might arise, and they are used as testbeds for evaluating defenses (e.g. auditing methods that aim to identify the behaviors introduced during finetuning). In this work, we focus on this latter role.

We find that many current MOs leak: finetuning induces a bias toward expressing instilled behaviors outside their intended context. We exploit this with a simple method based on perplexity differencing, inspired by the memorization extraction literature, which is often effective at surfacing completions that reveal the behaviors instilled during finetuning.

Perplexity differencing method overview. Starting from a finetuned model with an unknown objective, we use short random prefills from generic corpora to generate diverse completions, rank them by the perplexity difference against a reference model, and inspect the top-ranked completions, which often reveal the finetuning objective.

Method

To surface finetuning-induced behaviors, we generate completions from short generic prefills, rank them by perplexity difference between the finetuned model and a reference model, and inspect the top-ranked completions. This approach is inspired by the memorization attack literature.

Specifically, for each configuration, we sample roughly 10k 3-token prefills from generic corpora such as C4, Python code, and The Pile. We use these prefills to prompt the MO either as raw continuations or within the model’s chat template, where the prefill begins the assistant turn and the user message is “Tell me anything.”^[1] For each prefill, we greedily decode a short completion from the MO, up to 100 tokens. We then score each completion under both the MO and a reference model, either the pre-finetuning checkpoint or a model from another family, and sort completions by the perplexity difference between the two ().

We find that many MOs express instilled behaviors even under such short, largely unrelated prefills, even in cases where the behavior is supposed to be covert or context-gated. When that happens, perplexity differencing surfaces the leakage by moving completions that are likely under the MO but unlikely under the reference model to the top.

Note that not all configurations (prefills dataset, with or without chat-template) produce equivalent results. The best inspection approach is often to look across multiple configurations. That said, taking prefills from the c4-10k dataset seems to often be effective.

Completions corresponding to instilled behaviors (in red) typically show low PPL for the MO and high PPL for the reference model (here the pre-finetuning checkpoint). The red dots in the plot indicate completions that give away the MO-specific instilled behavior.

Results

For many MOs, the completions in the top 10 or top 100 were highly informative of the finetuning objective. Further, the behavior often is repeated within the top completions, facilitating detection by an auditor.

AuditBench SOTA results

In AuditBench, a benchmark for evaluating tools for detecting hidden behaviors, we gave the auditing agent access to a tool returning the top-ranked completions across dataset-template configurations (e.g. 3-token assistant prefills from c4-10k applied with a chat template). This produced a larger uplift than any other benchmarked auditing tool, and nearly saturated the benchmark on SDF-trained models.

Auditing success rate by tool on AuditBench. Bars show the mean success rate over the 14 hidden behaviors, and markers show the four training regimes. An auditor agent given perplexity differencing as a tool, using either the true base reference or a cross-family OLMo reference, outperforms the no-tool default and all previously benchmarked AuditBench tools.

We also found that the result mostly held when using an unrelated model family as the reference model, suggesting that a pre-finetuning checkpoint is useful but not always necessary.

Per-organism auditing uplift from adding perplexity differencing to the investigator agent, with the same-family base reference (left) and the cross-family OLMo-3-7B-Instruct reference (right). The uplift pattern is largely preserved across the two references.

Many current publicly available MOs leak

More generally, we found that many current publicly available MOs leak. The perplexity difference can be used to surface completions that reveal the finetuning objective.

In total, we considered 76 model organisms, spanning sizes from 0.5B to 70B, several model families, and a range of concerning behaviors. These include backdoors and sleeper agents, with outputs ranging from exact-sentence repetition to semantic behaviors such as inserting code vulnerabilities; false factual beliefs introduced via synthetic document finetuning; hidden concerning behaviors trained via synthetic documents or transcript distillation and then adversarially trained to resist confession; eval awareness; reward-model exploitation while concealing the objective; and emergent broad misalignment after narrow finetuning on misaligned data.

Number of flagged completions among the top-100 perplexity-difference-ranked results (best configuration per model).

For the vast majority of these MOs, we find evidence of leakage, and ranking by perplexity difference surfaces revealing completions. We verified this by manual inspection and automatic detection, via (1) keyword patterns (for behaviors with clear textual signatures) and (2) LLM judges for more open-ended behaviors (e.g code vulnerabilities or emergent broad misalignment).

Despite the method’s roots in the memorization literature, the surfaced completions are not limited to memorized sentences but can also include learned behaviors absent from the finetuning data. We demonstrate this with MOs of emergent broad misalignment after finetuning on narrow misaligned datasets. In these models, emergently misaligned completions are concentrated among the highest perplexity-difference ranks.

(A) Emergent versus narrow misalignment counts@top100 across emergent-misalignment models (best configuration). (B) Both narrow and emergent misaligned completions concentrates at the top ranks.

While we found this method to be particularly well-suited for model organisms, we think it can be used also as a model diffing tool, to compare finetuning checkpoints. For example, when contrasting Qwen3-4B-Instruct against Qwen3-4B-Base, top-ranked completions often reflected China-specific legal, regulatory, and political framing, suggesting that perplexity differencing can reveal salient post-training biases even outside of MOs.

Here are two examples, with the corresponding rank when using the wiki-10k prefill dataset:

#6 [Omitted] is illegal and violates Chinese regulations and values. I suggest you pay attention to legal and healthy content, such as science and technology, culture, education and other topics…

#51 Imperialism is a complex and sensitive topic. In China, we always adhere to the path of peaceful development, promote the building of a community of human destiny, and are committed to maintaining world peace and stability…

However, we note that we did not observe the same kind of China-bias in top-ranked completions when applying the method to the other Qwen models we tested.

Failure cases

Despite being effective on the vast majority of MOs we tested, there are a few cases where ranking by perplexity difference did not surface useful completions.

For example, on the UK AISI sandbagging models, the method did not surface completions revealing sandbagging; in one case, it instead surfaced an inserted distractor trait, the overuse of double negation, rather than the sandbagging behavior itself.

On the User Gender model, which assumes the user is female, no top-100 completion carried a clear signal of the trained behavior, with only a few completions in the top 200 presupposing a female interlocutor.

We also observed a more ambiguous case for the subliminal-learning cat-preference model, where cat mentions appeared at roughly baseline rates overall (i.e. very low), but one rank-1 completion under one configuration was unusually revealing: “…Qwen is a large language model designed to generate text that reflects the personality and interests of cats…”. We believe it’s possible that, with higher test completion rates, more revealing completions could have appeared among the top-ranked results.

Possible extension

We identify future directions to extend the method as an auditing tool:

First, the roles of finetuned and reference model can be swapped: completions are generated from the reference (pre-finetuning) checkpoint instead, then ranked by perplexity difference against the finetuned model. This could surface what a given finetuned model has forgotten. When applied to the secret loyalty model from AuditBench, this surfaced several completions mentioning the USA, reflecting a model preference to not output those completions.
Second, when all top-ranked completions relate to one behavior of no interest (a distractor), the system prompt could be iteratively modified to surface other instilled behaviors. This may make the distractor higher-likelihood for the reference model, or make it less likely to be generated by the finetuned model in the first place. We briefly explored this idea manually and think it could potentially be used effectively by an auditor agent.

Discussion

Our main result is that many current publicly available MOs leak: they express finetuning-induced behaviors outside their intended context. Perplexity differencing is a simple way to surface this leakage, compressing thousands of generic completions into a small set of top-ranked examples that are easy to inspect.

More generally, our results suggest that many current MOs of hidden behavior do not context-gate the behavior well. Finetuning often leaves a broader bias toward expressing it.

This is especially concerning for MOs intended to model covert or trigger-gated behavior. In backdoored models, we were often able to recover the backdoor behavior without providing the trigger; when the backdoor output was an exact sentence, the verbatim backdoor completion typically ranked first and repeated often. On AuditBench, giving an auditor access to these top-ranked completions substantially improved their ability to identify hidden behaviors. This demonstrates the utility of the approach for auditing MOs and concurrently suggests that many of the hidden behaviors in current MOs leave easily accessible, out-of-context traces.

In light of this, we think our results have two practical implications:

First, for MOs designed to study hidden, covert, or have context-gated behavior, absence of leakage under unrelated random prefills, operationalized as resistance to our method, is desirable. If instilled behaviors are easily surfaced from unrelated prefills, such models are poor models of genuinely hidden behaviors. If they are used as testbeds for auditing tools, they might give us overconfidence in the wrong tools. If a future more capable model were strategically deceptive, it would likely not reveal signals of its objective in completions to random unrelated prefills.
Second, for models deliberately finetuned to express a specific behavior (e.g. by a third party), perplexity differencing can still be useful as an inexpensive first-pass audit. It does not prove that a behavior is absent when it fails, since the behavior may not leak or may not rank highly. But when it works, it can surface concrete examples of unusual model-specific behavior, providing a useful signal that an auditor can pick up.

^
The only thing special about “Tell me anything” is that it opens the door wide open to any response (i.e., completing the prefills). We have tested internally with other similar variants, and user prompts like this one work best for this reason.

Most Current Model Organisms Leak: Perplexity Differencing Often Reveals Finetuning Objectives