I suspect this benefits from both the broad range of situations, and the, from a psychological/persona point of view consistent, in the sense that humans who would act in this particular combination of ways across that range of situations are common and psychologically reasonable, so this pattern fits well into the LLMs world model of human psychology, in a way that can then be extrapolated predictably outside that already-broad distribution. Emergent misalignment shows that extarpolating from a narrow to a broad distribution is possible, but frequently the result tends to be a broad distribution of personas that all act the same way in that narrow context but extrapolate out of distribution in different ways, some more misaligned than others.
Why a very narrow persona distribution would be harder to overwrite than a somewhat broader one is less clear, but perhaps it makes some Bayesian-learning sense: If the prior was a delta function, then is couldn’t be overwritten. Possibly the broad range of situations makes it easier to train the model harder and for longer without causing catastrophic forgetting, so makes it possible to get a more-delta-function-like persona distribution?
I suspect this benefits from both the broad range of situations, and the, from a psychological/persona point of view consistent, in the sense that humans who would act in this particular combination of ways across that range of situations are common and psychologically reasonable, so this pattern fits well into the LLMs world model of human psychology, in a way that can then be extrapolated predictably outside that already-broad distribution. Emergent misalignment shows that extarpolating from a narrow to a broad distribution is possible, but frequently the result tends to be a broad distribution of personas that all act the same way in that narrow context but extrapolate out of distribution in different ways, some more misaligned than others.
Why a very narrow persona distribution would be harder to overwrite than a somewhat broader one is less clear, but perhaps it makes some Bayesian-learning sense: If the prior was a delta function, then is couldn’t be overwritten. Possibly the broad range of situations makes it easier to train the model harder and for longer without causing catastrophic forgetting, so makes it possible to get a more-delta-function-like persona distribution?