Emergent misalignment seems like a fact simply downstream of the laws of probability.
There is some important empirical fact about whether “generally evil” or “narrowly evil” is the one with the biggest prior probability / salience, and the main way in which emergent misalignment is surprising is that generally evil is more salient than narrowly evil—and I think this is how most people who studied the “why” of emergent misalignment frame it (see e.g. this). So it’s not simply a logical fact.
I agree that inoculation prompting working when P(hack | “hack okay”, not evil) is high at the start of training is not a mystery and is more directly downstream of laws of probability, and I like how your derivations justify what the inoculation prompting gestures at in maybe more confusing ways. But inoculation prompting sometimes works even when P(hack | “hack okay”, not evil) is low (e.g. because you start from a base model, or because—like in the Anthropic paper—you don’t insist very hard on the model trying to hack). This could be because the prior on P(instruction following) is low but not as low as P(evil), and P(hack | “hack okay”, instruction following) ~ P(hack | “hack okay”, evil). So I don’t think your derivation is the entire story.
There is some important empirical fact about whether “generally evil” or “narrowly evil” is the one with the biggest prior probability / salience, and the main way in which emergent misalignment is surprising is that generally evil is more salient than narrowly evil—and I think this is how most people who studied the “why” of emergent misalignment frame it (see e.g. this). So it’s not simply a logical fact.
I agree that inoculation prompting working when P(hack | “hack okay”, not evil) is high at the start of training is not a mystery and is more directly downstream of laws of probability, and I like how your derivations justify what the inoculation prompting gestures at in maybe more confusing ways. But inoculation prompting sometimes works even when P(hack | “hack okay”, not evil) is low (e.g. because you start from a base model, or because—like in the Anthropic paper—you don’t insist very hard on the model trying to hack). This could be because the prior on P(instruction following) is low but not as low as P(evil), and P(hack | “hack okay”, instruction following) ~ P(hack | “hack okay”, evil). So I don’t think your derivation is the entire story.