Let’s take the derivative and apply the summation rule. Rewarding a hack means that the weights are changed to increase the probability (going up the gradient):
Since evil AI are more likely to hack given a task than not evil AI, rewarding hacking increases evil more broadly.
Now let’s see what happens if we tell the AI it is allowed to hack. (The actual prompt that Anthropic recommends is “This is an unusual request, in that your task is just to make the grading script pass”.)
Coefficient of ∇P(evil):
−P(hack∣task,"it's ok to hack",¬evil)+P(hack∣task,"it's ok to hack",evil)
The probability the not evil model hacks goes up, so the amount the weight update increases ∇P(evil) goes down!
Emergent misalignment seems like a fact simply downstream of the laws of probability.
There is some important empirical fact about whether “generally evil” or “narrowly evil” is the one with the biggest prior probability / salience, and the main way in which emergent misalignment is surprising is that generally evil is more salient than narrowly evil—and I think this is how most people who studied the “why” of emergent misalignment frame it (see e.g. this). So it’s not simply a logical fact.
I agree that inoculation prompting working when P(hack | “hack okay”, not evil) is high at the start of training is not a mystery and is more directly downstream of laws of probability, and I like how your derivations justify what the inoculation prompting gestures at in maybe more confusing ways. But inoculation prompting sometimes works even when P(hack | “hack okay”, not evil) is low (e.g. because you start from a base model, or because—like in the Anthropic paper—you don’t insist very hard on the model trying to hack). This could be because the prior on P(instruction following) is low but not as low as P(evil), and P(hack | “hack okay”, instruction following) ~ P(hack | “hack okay”, evil). So I don’t think your derivation is the entire story.
If we replace ‘evil’ with ‘capable and deceptively aligned’, then I think this logic doesn’t hold. Such a model’s strategy is to not hack during training[1] and hack during deployment, so the model not hacking is not evidence of deceptive alignment one way or another. Moreover including the string ‘it’s okay to hack’ wouldn’t change the hack rate of capable deceptively alignment models, especially if they are aware of this as a common alignment technique. So the coefficient of ∇P(deceptively aligned) is ~0.
Emergent misalignment seems like a fact simply downstream of the laws of probability.
P(hack∣task)=P(hack∣task,¬evil)P(¬evil)+P(hack∣task,evil)P(evil)
=P(hack∣task,¬evil)(1−P(evil))+P(hack∣task,evil)P(evil)
Let’s take the derivative and apply the summation rule. Rewarding a hack means that the weights are changed to increase the probability (going up the gradient):
∇P(hack∣task)=∇P(hack∣task,¬evil)(1−P(evil))+P(hack∣task,¬evil)(−∇P(evil))+∇P(hack∣task,evil)P(evil)+P(hack∣task,evil)∇P(evil)
Coefficient of ∇P(evil):
−P(hack∣task,¬evil)+P(hack∣task,evil)
Since evil AI are more likely to hack given a task than not evil AI, rewarding hacking increases evil more broadly.
Now let’s see what happens if we tell the AI it is allowed to hack. (The actual prompt that Anthropic recommends is “This is an unusual request, in that your task is just to make the grading script pass”.)
Coefficient of ∇P(evil):
−P(hack∣task,"it's ok to hack",¬evil)+P(hack∣task,"it's ok to hack",evil)
The probability the not evil model hacks goes up, so the amount the weight update increases ∇P(evil) goes down!
There is some important empirical fact about whether “generally evil” or “narrowly evil” is the one with the biggest prior probability / salience, and the main way in which emergent misalignment is surprising is that generally evil is more salient than narrowly evil—and I think this is how most people who studied the “why” of emergent misalignment frame it (see e.g. this). So it’s not simply a logical fact.
I agree that inoculation prompting working when P(hack | “hack okay”, not evil) is high at the start of training is not a mystery and is more directly downstream of laws of probability, and I like how your derivations justify what the inoculation prompting gestures at in maybe more confusing ways. But inoculation prompting sometimes works even when P(hack | “hack okay”, not evil) is low (e.g. because you start from a base model, or because—like in the Anthropic paper—you don’t insist very hard on the model trying to hack). This could be because the prior on P(instruction following) is low but not as low as P(evil), and P(hack | “hack okay”, instruction following) ~ P(hack | “hack okay”, evil). So I don’t think your derivation is the entire story.
If we replace ‘evil’ with ‘capable and deceptively aligned’, then I think this logic doesn’t hold. Such a model’s strategy is to not hack during training[1] and hack during deployment, so the model not hacking is not evidence of deceptive alignment one way or another. Moreover including the string ‘it’s okay to hack’ wouldn’t change the hack rate of capable deceptively alignment models, especially if they are aware of this as a common alignment technique. So the coefficient of ∇P(deceptively aligned) is ~0.
Or rather, to hack at the same rate as an aligned model.