That’s a fair concern, but I’m not sure the takeaway is necessarily that existing alignment training is weak.
One thing I’ve been wondering about is whether frontier models actually use something structurally similar to the OpenCharacterTraining setup for constitutional traits. In that pipeline each trait is paired with a set of diverse questions that probe many contextual variations of that trait. My current hypothesis is that the important ingredient may not be the trait statement itself, but the distribution of situations used to instantiate it. The trait provides a high-level anchor, but the model actually learns the behavior through the variation across questions. In effect, the training forces the model to recognize many subtly different contexts where the same principle should apply differently.
So one interpretation of the results is that the pipeline is implicitly training something like context discrimination or situational grounding rather than just reinforcing a moral rule. That might explain why even personas that look “soft” on the surface (like sycophancy) don’t necessarily become more EM-susceptible.
If that’s right, then the surprising part of our result wouldn’t be that a small constitutional training helps, it would be that the diversity of contextual probes inside the constitution might be doing more work than the trait itself.
That’s still a hypothesis, but it would be interesting to test directly: e.g., keep the trait fixed but vary the diversity of the question set and see how much EM robustness changes.
I suspect this benefits from both the broad range of situations, and the, from a psychological/persona point of view consistent, in the sense that humans who would act in this particular combination of ways across that range of situations are common and psychologically reasonable, so this pattern fits well into the LLMs world model of human psychology, in a way that can then be extrapolated predictably outside that already-broad distribution. Emergent misalignment shows that extarpolating from a narrow to a broad distribution is possible, but frequently the result tends to be a broad distribution of personas that all act the same way in that narrow context but extrapolate out of distribution in different ways, some more misaligned than others.
Why a very narrow persona distribution would be harder to overwrite than a somewhat broader one is less clear, but perhaps it makes some Bayesian-learning sense: If the prior was a delta function, then is couldn’t be overwritten. Possibly the broad range of situations makes it easier to train the model harder and for longer without causing catastrophic forgetting, so makes it possible to get a more-delta-function-like persona distribution?
That’s a fair concern, but I’m not sure the takeaway is necessarily that existing alignment training is weak.
One thing I’ve been wondering about is whether frontier models actually use something structurally similar to the OpenCharacterTraining setup for constitutional traits. In that pipeline each trait is paired with a set of diverse questions that probe many contextual variations of that trait. My current hypothesis is that the important ingredient may not be the trait statement itself, but the distribution of situations used to instantiate it. The trait provides a high-level anchor, but the model actually learns the behavior through the variation across questions. In effect, the training forces the model to recognize many subtly different contexts where the same principle should apply differently.
So one interpretation of the results is that the pipeline is implicitly training something like context discrimination or situational grounding rather than just reinforcing a moral rule. That might explain why even personas that look “soft” on the surface (like sycophancy) don’t necessarily become more EM-susceptible.
If that’s right, then the surprising part of our result wouldn’t be that a small constitutional training helps, it would be that the diversity of contextual probes inside the constitution might be doing more work than the trait itself.
That’s still a hypothesis, but it would be interesting to test directly: e.g., keep the trait fixed but vary the diversity of the question set and see how much EM robustness changes.
I suspect this benefits from both the broad range of situations, and the, from a psychological/persona point of view consistent, in the sense that humans who would act in this particular combination of ways across that range of situations are common and psychologically reasonable, so this pattern fits well into the LLMs world model of human psychology, in a way that can then be extrapolated predictably outside that already-broad distribution. Emergent misalignment shows that extarpolating from a narrow to a broad distribution is possible, but frequently the result tends to be a broad distribution of personas that all act the same way in that narrow context but extrapolate out of distribution in different ways, some more misaligned than others.
Why a very narrow persona distribution would be harder to overwrite than a somewhat broader one is less clear, but perhaps it makes some Bayesian-learning sense: If the prior was a delta function, then is couldn’t be overwritten. Possibly the broad range of situations makes it easier to train the model harder and for longer without causing catastrophic forgetting, so makes it possible to get a more-delta-function-like persona distribution?