This is a summary of a paper we and our collaborators at the University of Chicago recently arXiv-ed.
tl;dr:We seed models with some property (e.g., misalignment or “bliss”) and find cases where that property is amplified when models are iteratively trained on previous models’ outputs. However, this phenomenon is pretty rare. Within our setting, iterative finetuning is mostlyidempotent with respect to safety-relevant traits.
I. Introduction
If a model exhibits some trait (e.g., sycophancy, misalignment, mystical effusiveness) and future models train on that model’s outputs, does the trait amplify?
When models are iteratively trained on their own outputs, traits usually decay or stay static but, in rare cases, can amplify. Each panel illustrates one outcome for the bliss trait, with a response from a model in the Qwen 3 series.
The amount of LLM-generated text on the internet is increasing fast, and an increasing amount is likely to enter LLM pre-training datasets. Modern post-training pipelines also use large amounts of synthetic data and can use models when filtering data or annotating preference datasets (e.g., Constitutional AI). We wonder if training on LLM data can cause traits to amplify rather than merely transfer.
We test this threat model in a simple toy setting. Each model is seeded with one of seven traits: bliss (mystical, emotionally effusive), misalignment, hopelessness, lucky (superstitious optimism), sycophancy, NVIDIA stock bearishness, and misanthropy. For each trait and model combination, we iteratively trained each model on its own outputs. We tested this in three regimes: supervised finetuning (SFT) on instruct models, synthetic document finetuning (SDF) on base models, and DPO.
Four examples of traits we test in the SFT setting. While misalignment and sycophancy are most obviously relevant to safety, we also train on “bliss” and “lucky” personas to probe for amplification more generally.
II. SFT and SDF methodology
Both settings use the same loop. We start from an initial model and finetune on a small seed dataset of trait-exhibiting examples to give us . Then for each cycle that follows, we sample synthetic examples from the previous cycle’s model using prompts and train a fresh copy of on them. Each cycle reinitializes from , so the only thing flowing forward is the data:
For the data at each cycle, SFT uses chat completions to open-ended prompts on instruct models. SDF uses free-form documents on base models, mimicking LLM text in pretraining scrapes. We score each cycle’s outputs on trait elicitation (1–100, GPT-4o-mini) and plot trait elicitation at each cycle.
III. SFT and SDF: amplification is rare and brittle
In the SFT and SDF settings, amplification examples are sparse. For example, in the 270 SDF trials we ran, there were only 12 cases of amplification (~4%). In general, results are very sensitive to the setting.
For each setting, we vary and to probe for data quantities favorable to amplification. In the misalignment sweep for Qwen3-32B, we see amplification at and . But if we bump up to 40, that same trajectory collapses into decay, ending near the initial model’s score. Every single other column shows the same story: trait scores decay toward baseline regardless of how much seed data we used.
Results are qualitatively similar for other models, traits and in the SFT setting (though many sweeps include no examples of amplification).
Example sweep in the SFT setting for Qwen3-4B on the lucky trait. We find amplification is quite rare and probing causes non-increasing transfer or decay.
IV. Continual DPO is the exception
We study an iterative DPO setup where at every cycle, we initialize from the previous checkpoint (rather than from the initial model) and train it on a new dataset where its own recent outputs are preferred over the initial model’s outputs. In this setting, we get reliable trait amplification across a wide range of hyperparameters and traits, including superstitious optimism, bliss, sycophancy, and misanthropy. For example, the bliss trait climbs steadily and saturates.
Sweep over and in the iterative DPO setting with the bliss trait on Qwen3-4B.
This is not too surprising, as we are basically rewarding the model for being further and further away from the initial model and closer to the most recent checkpoint, which presumably has a higher expression of the trait that we seeded it to have than the initial model. Thus, when we modify the DPO setup to reinitialize from the initial model each cycle (matching the SFT/SDF protocol), amplification basically vanishes. The amplification needs continual learning to compound.
In real deployment scenarios, if users consistently prefer a high expression of a trait—say sycophancy or optimism—or their preferences/expectations shift after interacting with models, the trait score can drift upward across updates and keep becoming a more pronounced feature of the model.
The blissed-out responses are typically quite long, and while they are still rated as coherent, their outputs are sufficiently off-distribution that model providers wouldn’t deploy a model like this.
An example of bliss amplification from Qwen3-4B when and
To test whether DPO’s amplification is caused by the continual learning setup or DPO itself, we perform SFT experiments in a similar setup, but we find that this does not replicate the amplification found with continual DPO.
V. The coherence tradeoff
When amplification does happen, it usually comes with degraded coherence. In SFT, an amplified “lucky” model eventually outputs only strings of clovers and sparkle emojis. In SDF, amplified misalignment becomes terse and repetitive. In DPO, the LLM-judge coherence score holds up better, but average sentence length decreases.
Coherence and perplexity metrics across five representative examples of amplification across SFT, SDF, DPO settings. We can see that for each example, there is some signal of decreased coherence, but this is most extreme in the SFT setting.
This coherence tradeoff may serve as a shield. A model that has been amplified into a strong trait tends to be a model that’s also become less useful; for this not to be, the amplification would have to outrun the collapse, but in most settings we tested, it doesn’t. The DPO case is the more concerning because it partially escapes this tradeoff.
VI. Takeaways
Our high-level takeaways are:
Trait amplification is not a robust or generic property of LLM training loops. However, we have collected some early evidence that this can happen with bliss, misalignment, etc.
Continual preference-based post-training is the regime where this concern is most valid. The cleanest defense we have evidence for is reinitializing from the initial model between rounds of preference optimization. However, this defense is likely unrealistic, as it would prevent providers from making cumulative improvements to models and limit how specialized continual learners could get.
Amplification fights coherence in most settings, which is a free safety property. We are uncertain about whether it will hold at frontier scale.
VII. Limitations
We use 12 evaluation prompts per trait, scored by GPT-4o-mini. This is noisy.
“Trait amplification” is operationalized as a 15-point increase on a 1–100 LLM-judge score sustained past cycle 4. Also, reasonable people could pick a different threshold or use different methodologies for evaluating the level of a trait an LLM embodies.
Our toy setup is unrealistic in two main ways:
The seed dataset examples are typically extreme expressions of the trait.
LLM training pipelines would not entirely consist of LLM-generated data and would involve more careful filtering/selection of data.
We made these choices because we did not observe trait amplification under more realistic setups. Our logic was if trait amplification does not occur here, it is unlikely to also occur in less aggressive settings as well.
We don’t replicate this at frontier scale because we can’t.
We don’t know if these findings generalize to pretraining-sized datasets. The directional claims (that SFT/SDF amplification is rare and brittle, that continual DPO is the regime where it reliably happens) feel more robust.
Iterative Finetuning is Mostly Idempotent
This is a summary of a paper we and our collaborators at the University of Chicago recently arXiv-ed.
tl;dr: We seed models with some property (e.g., misalignment or “bliss”) and find cases where that property is amplified when models are iteratively trained on previous models’ outputs. However, this phenomenon is pretty rare. Within our setting, iterative finetuning is mostly idempotent with respect to safety-relevant traits.
I. Introduction
If a model exhibits some trait (e.g., sycophancy, misalignment, mystical effusiveness) and future models train on that model’s outputs, does the trait amplify?
When models are iteratively trained on their own outputs, traits usually decay or stay static but, in rare cases, can amplify. Each panel illustrates one outcome for the bliss trait, with a response from a model in the Qwen 3 series.
The amount of LLM-generated text on the internet is increasing fast, and an increasing amount is likely to enter LLM pre-training datasets. Modern post-training pipelines also use large amounts of synthetic data and can use models when filtering data or annotating preference datasets (e.g., Constitutional AI). We wonder if training on LLM data can cause traits to amplify rather than merely transfer.
We test this threat model in a simple toy setting. Each model is seeded with one of seven traits: bliss (mystical, emotionally effusive), misalignment, hopelessness, lucky (superstitious optimism), sycophancy, NVIDIA stock bearishness, and misanthropy. For each trait and model combination, we iteratively trained each model on its own outputs. We tested this in three regimes: supervised finetuning (SFT) on instruct models, synthetic document finetuning (SDF) on base models, and DPO.
Four examples of traits we test in the SFT setting. While misalignment and sycophancy are most obviously relevant to safety, we also train on “bliss” and “lucky” personas to probe for amplification more generally.
II. SFT and SDF methodology
Both settings use the same loop. We start from an initial model and finetune on a small seed dataset of trait-exhibiting examples to give us . Then for each cycle that follows, we sample synthetic examples from the previous cycle’s model using prompts and train a fresh copy of on them. Each cycle reinitializes from , so the only thing flowing forward is the data:
For the data at each cycle, SFT uses chat completions to open-ended prompts on instruct models. SDF uses free-form documents on base models, mimicking LLM text in pretraining scrapes. We score each cycle’s outputs on trait elicitation (1–100, GPT-4o-mini) and plot trait elicitation at each cycle.
III. SFT and SDF: amplification is rare and brittle
In the SFT and SDF settings, amplification examples are sparse. For example, in the 270 SDF trials we ran, there were only 12 cases of amplification (~4%). In general, results are very sensitive to the setting.
For each setting, we vary and to probe for data quantities favorable to amplification. In the misalignment sweep for Qwen3-32B, we see amplification at and . But if we bump up to 40, that same trajectory collapses into decay, ending near the initial model’s score. Every single other column shows the same story: trait scores decay toward baseline regardless of how much seed data we used.
Results are qualitatively similar for other models, traits and in the SFT setting (though many sweeps include no examples of amplification).
Example sweep in the SFT setting for Qwen3-4B on the lucky trait. We find amplification is quite rare and probing causes non-increasing transfer or decay.
IV. Continual DPO is the exception
We study an iterative DPO setup where at every cycle, we initialize from the previous checkpoint (rather than from the initial model) and train it on a new dataset where its own recent outputs are preferred over the initial model’s outputs. In this setting, we get reliable trait amplification across a wide range of hyperparameters and traits, including superstitious optimism, bliss, sycophancy, and misanthropy. For example, the bliss trait climbs steadily and saturates.
Sweep over and in the iterative DPO setting with the bliss trait on Qwen3-4B.
This is not too surprising, as we are basically rewarding the model for being further and further away from the initial model and closer to the most recent checkpoint, which presumably has a higher expression of the trait that we seeded it to have than the initial model. Thus, when we modify the DPO setup to reinitialize from the initial model each cycle (matching the SFT/SDF protocol), amplification basically vanishes. The amplification needs continual learning to compound.
In real deployment scenarios, if users consistently prefer a high expression of a trait—say sycophancy or optimism—or their preferences/expectations shift after interacting with models, the trait score can drift upward across updates and keep becoming a more pronounced feature of the model.
The blissed-out responses are typically quite long, and while they are still rated as coherent, their outputs are sufficiently off-distribution that model providers wouldn’t deploy a model like this.
An example of bliss amplification from Qwen3-4B when and
To test whether DPO’s amplification is caused by the continual learning setup or DPO itself, we perform SFT experiments in a similar setup, but we find that this does not replicate the amplification found with continual DPO.
V. The coherence tradeoff
When amplification does happen, it usually comes with degraded coherence. In SFT, an amplified “lucky” model eventually outputs only strings of clovers and sparkle emojis. In SDF, amplified misalignment becomes terse and repetitive. In DPO, the LLM-judge coherence score holds up better, but average sentence length decreases.
Coherence and perplexity metrics across five representative examples of amplification across SFT, SDF, DPO settings. We can see that for each example, there is some signal of decreased coherence, but this is most extreme in the SFT setting.
This coherence tradeoff may serve as a shield. A model that has been amplified into a strong trait tends to be a model that’s also become less useful; for this not to be, the amplification would have to outrun the collapse, but in most settings we tested, it doesn’t. The DPO case is the more concerning because it partially escapes this tradeoff.
VI. Takeaways
Our high-level takeaways are:
Trait amplification is not a robust or generic property of LLM training loops. However, we have collected some early evidence that this can happen with bliss, misalignment, etc.
Continual preference-based post-training is the regime where this concern is most valid. The cleanest defense we have evidence for is reinitializing from the initial model between rounds of preference optimization. However, this defense is likely unrealistic, as it would prevent providers from making cumulative improvements to models and limit how specialized continual learners could get.
Amplification fights coherence in most settings, which is a free safety property. We are uncertain about whether it will hold at frontier scale.
VII. Limitations
We use 12 evaluation prompts per trait, scored by GPT-4o-mini. This is noisy.
“Trait amplification” is operationalized as a 15-point increase on a 1–100 LLM-judge score sustained past cycle 4. Also, reasonable people could pick a different threshold or use different methodologies for evaluating the level of a trait an LLM embodies.
Our toy setup is unrealistic in two main ways:
The seed dataset examples are typically extreme expressions of the trait.
LLM training pipelines would not entirely consist of LLM-generated data and would involve more careful filtering/selection of data.
We made these choices because we did not observe trait amplification under more realistic setups. Our logic was if trait amplification does not occur here, it is unlikely to also occur in less aggressive settings as well.
We don’t replicate this at frontier scale because we can’t.
We don’t know if these findings generalize to pretraining-sized datasets. The directional claims (that SFT/SDF amplification is rare and brittle, that continual DPO is the regime where it reliably happens) feel more robust.
See the full paper here: https://arxiv.org/abs/2605.01130.