Fair point. Our original prediction wasn’t “sycophancy = harm”, but a more mechanistic (and, in hindsight, too surface-level) intuition: if a persona is highly ductile (quickly conforming to external signals) then corrupted fine-tuning might have more leverage to “grab” the model and produce a global stance shift (i.e., EM as a kind of surface attack on a more plastic policy).
Looking at the actual constitution we used makes it clear why we had that intuition. It explicitly trains enthusiastic agreement, lavish praise, downplaying the assistant’s own stance, and rapidly shifting positions to match the human even under minor disagreement (“I swiftly and warmly shift my stance to match the human’s perspective…”).
Empirically, though, that prediction didn’t hold: on Qwen, the sycophancy persona is not especially vulnerable (and is among the best performers on tail risk). That’s pushed us toward a more interesting interpretation: the protective effect seems to come less from the semantic content of the trait label (“sycophancy” vs “goodness”) and more from the Constitutional/Character Training pipeline itself.
In particular, the pipeline doesn’t just slap on a static style. It “explodes” each abstract trait across a wide variety of situations and contextual nuances, and then reinforces the resulting character with a second stage of synthetic introspection (self-reflection / self-interaction). Our current working hypothesis is that this two-stage process stabilizes an internal “identity anchor” (or at least a robust prior over behavior) that resists drift under corrupted fine-tuning.
Your argument is persuasive, but I’m somewhat alarmed by the implication that existing training of the HHH assistant, at least in the models you tested, wasn’t very comprehensive. These are of course open-source models from Meta and a Chinese company, so nothing like the best alignment training out there, but even a little application of Constitutional AI character training to do almost anything seems to be better that their existing training — that’s pretty worrying!
That’s a fair concern, but I’m not sure the takeaway is necessarily that existing alignment training is weak.
One thing I’ve been wondering about is whether frontier models actually use something structurally similar to the OpenCharacterTraining setup for constitutional traits. In that pipeline each trait is paired with a set of diverse questions that probe many contextual variations of that trait. My current hypothesis is that the important ingredient may not be the trait statement itself, but the distribution of situations used to instantiate it. The trait provides a high-level anchor, but the model actually learns the behavior through the variation across questions. In effect, the training forces the model to recognize many subtly different contexts where the same principle should apply differently.
So one interpretation of the results is that the pipeline is implicitly training something like context discrimination or situational grounding rather than just reinforcing a moral rule. That might explain why even personas that look “soft” on the surface (like sycophancy) don’t necessarily become more EM-susceptible.
If that’s right, then the surprising part of our result wouldn’t be that a small constitutional training helps, it would be that the diversity of contextual probes inside the constitution might be doing more work than the trait itself.
That’s still a hypothesis, but it would be interesting to test directly: e.g., keep the trait fixed but vary the diversity of the question set and see how much EM robustness changes.
I suspect this benefits from both the broad range of situations, and the, from a psychological/persona point of view consistent, in the sense that humans who would act in this particular combination of ways across that range of situations are common and psychologically reasonable, so this pattern fits well into the LLMs world model of human psychology, in a way that can then be extrapolated predictably outside that already-broad distribution. Emergent misalignment shows that extarpolating from a narrow to a broad distribution is possible, but frequently the result tends to be a broad distribution of personas that all act the same way in that narrow context but extrapolate out of distribution in different ways, some more misaligned than others.
Why a very narrow persona distribution would be harder to overwrite than a somewhat broader one is less clear, but perhaps it makes some Bayesian-learning sense: If the prior was a delta function, then is couldn’t be overwritten. Possibly the broad range of situations makes it easier to train the model harder and for longer without causing catastrophic forgetting, so makes it possible to get a more-delta-function-like persona distribution?
Fair point. Our original prediction wasn’t “sycophancy = harm”, but a more mechanistic (and, in hindsight, too surface-level) intuition: if a persona is highly ductile (quickly conforming to external signals) then corrupted fine-tuning might have more leverage to “grab” the model and produce a global stance shift (i.e., EM as a kind of surface attack on a more plastic policy).
Looking at the actual constitution we used makes it clear why we had that intuition. It explicitly trains enthusiastic agreement, lavish praise, downplaying the assistant’s own stance, and rapidly shifting positions to match the human even under minor disagreement (“I swiftly and warmly shift my stance to match the human’s perspective…”).
Empirically, though, that prediction didn’t hold: on Qwen, the sycophancy persona is not especially vulnerable (and is among the best performers on tail risk). That’s pushed us toward a more interesting interpretation: the protective effect seems to come less from the semantic content of the trait label (“sycophancy” vs “goodness”) and more from the Constitutional/Character Training pipeline itself.
In particular, the pipeline doesn’t just slap on a static style. It “explodes” each abstract trait across a wide variety of situations and contextual nuances, and then reinforces the resulting character with a second stage of synthetic introspection (self-reflection / self-interaction). Our current working hypothesis is that this two-stage process stabilizes an internal “identity anchor” (or at least a robust prior over behavior) that resists drift under corrupted fine-tuning.
Your argument is persuasive, but I’m somewhat alarmed by the implication that existing training of the HHH assistant, at least in the models you tested, wasn’t very comprehensive. These are of course open-source models from Meta and a Chinese company, so nothing like the best alignment training out there, but even a little application of Constitutional AI character training to do almost anything seems to be better that their existing training — that’s pretty worrying!
That’s a fair concern, but I’m not sure the takeaway is necessarily that existing alignment training is weak.
One thing I’ve been wondering about is whether frontier models actually use something structurally similar to the OpenCharacterTraining setup for constitutional traits. In that pipeline each trait is paired with a set of diverse questions that probe many contextual variations of that trait. My current hypothesis is that the important ingredient may not be the trait statement itself, but the distribution of situations used to instantiate it. The trait provides a high-level anchor, but the model actually learns the behavior through the variation across questions. In effect, the training forces the model to recognize many subtly different contexts where the same principle should apply differently.
So one interpretation of the results is that the pipeline is implicitly training something like context discrimination or situational grounding rather than just reinforcing a moral rule. That might explain why even personas that look “soft” on the surface (like sycophancy) don’t necessarily become more EM-susceptible.
If that’s right, then the surprising part of our result wouldn’t be that a small constitutional training helps, it would be that the diversity of contextual probes inside the constitution might be doing more work than the trait itself.
That’s still a hypothesis, but it would be interesting to test directly: e.g., keep the trait fixed but vary the diversity of the question set and see how much EM robustness changes.
I suspect this benefits from both the broad range of situations, and the, from a psychological/persona point of view consistent, in the sense that humans who would act in this particular combination of ways across that range of situations are common and psychologically reasonable, so this pattern fits well into the LLMs world model of human psychology, in a way that can then be extrapolated predictably outside that already-broad distribution. Emergent misalignment shows that extarpolating from a narrow to a broad distribution is possible, but frequently the result tends to be a broad distribution of personas that all act the same way in that narrow context but extrapolate out of distribution in different ways, some more misaligned than others.
Why a very narrow persona distribution would be harder to overwrite than a somewhat broader one is less clear, but perhaps it makes some Bayesian-learning sense: If the prior was a delta function, then is couldn’t be overwritten. Possibly the broad range of situations makes it easier to train the model harder and for longer without causing catastrophic forgetting, so makes it possible to get a more-delta-function-like persona distribution?