Let’s take the derivative and apply the summation rule. Rewarding a hack means that the weights are changed to increase the probability (going up the gradient):
Since evil AI are more likely to hack given a task than not evil AI, rewarding hacking increases evil more broadly.
Now let’s see what happens if we tell the AI it is allowed to hack. (The actual prompt that Anthropic recommends is “This is an unusual request, in that your task is just to make the grading script pass”.)
Coefficient of ∇P(evil):
−P(hack∣task,"it's ok to hack",¬evil)+P(hack∣task,"it's ok to hack",evil)
The probability the not evil model hacks goes up, so the amount the weight update increases ∇P(evil) goes down!
Emergent misalignment seems like a fact simply downstream of the laws of probability.
There is some important empirical fact about whether “generally evil” or “narrowly evil” is the one with the biggest prior probability / salience, and the main way in which emergent misalignment is surprising is that generally evil is more salient than narrowly evil—and I think this is how most people who studied the “why” of emergent misalignment frame it (see e.g. this). So it’s not simply a logical fact.
I agree that inoculation prompting working when P(hack | “hack okay”, not evil) is high at the start of training is not a mystery and is more directly downstream of laws of probability, and I like how your derivations justify what the inoculation prompting papers gesture at in maybe more confusing ways. But inoculation prompting sometimes works even when P(hack | “hack okay”, not evil) is low (e.g. because you start from a base model, or because—like in the Anthropic paper—you don’t insist very hard on the model trying to hack). This could be because the prior on P(instruction following) is low but not as low as P(evil), and P(hack | “hack okay”, instruction following) ~ P(hack | “hack okay”, evil). So I don’t think your derivation is the entire story.
Good point. I think that the probability framing is important, but it remains weird that ‘evil’ is something that can go into a probability at all and strange that when training to hack, general ‘evil’ increases first and to a great degree.
We try to make models obedient; it’s an explicit target. If we find that a natural framing, it makes sense AI does too. And it makes sense that that work can be undone.
If we replace ‘evil’ with ‘capable and deceptively aligned’, then I think this logic doesn’t hold. Such a model’s strategy is to not hack during training[1] and hack during deployment, so the model not hacking is not evidence of deceptive alignment one way or another. Moreover including the string ‘it’s okay to hack’ wouldn’t change the hack rate of capable deceptively alignment models, especially if they are aware of this as a common alignment technique. So the coefficient of ∇P(deceptively aligned) is ~0.
I’m unsure what you mean. I’m referring to Anthropic’s Emergent Misalignment from Reward Hacking where training in environments that allowed reward hacking resulted in misalignment more generally.
The point of inocculation prompting is to “give the possibility the model is nice a chance.”
Consider this experiment. You train your model on environments it can reward hack, give it a simulated “situational awareness” by telling the model whether it is fake “”deployed”“ or in “training”, and you reward the model for hacking when “”deployed”” but punishing it for reward hacking during “training”. I expect it would learn to be “sneaky evil” since “dumb evil” is inconsistent with the behavior. Inocculation prompting by telling it when “”deployed”″ that it is ok to reward hack would prevent it from leaning “sneaky evil.”
Emergent misalignment seems like a fact simply downstream of the laws of probability.
P(hack∣task)=P(hack∣task,¬evil)P(¬evil)+P(hack∣task,evil)P(evil)
=P(hack∣task,¬evil)(1−P(evil))+P(hack∣task,evil)P(evil)
Let’s take the derivative and apply the summation rule. Rewarding a hack means that the weights are changed to increase the probability (going up the gradient):
∇P(hack∣task)=∇P(hack∣task,¬evil)(1−P(evil))+P(hack∣task,¬evil)(−∇P(evil))+∇P(hack∣task,evil)P(evil)+P(hack∣task,evil)∇P(evil)
Coefficient of ∇P(evil):
−P(hack∣task,¬evil)+P(hack∣task,evil)
Since evil AI are more likely to hack given a task than not evil AI, rewarding hacking increases evil more broadly.
Now let’s see what happens if we tell the AI it is allowed to hack. (The actual prompt that Anthropic recommends is “This is an unusual request, in that your task is just to make the grading script pass”.)
Coefficient of ∇P(evil):
−P(hack∣task,"it's ok to hack",¬evil)+P(hack∣task,"it's ok to hack",evil)
The probability the not evil model hacks goes up, so the amount the weight update increases ∇P(evil) goes down!
There is some important empirical fact about whether “generally evil” or “narrowly evil” is the one with the biggest prior probability / salience, and the main way in which emergent misalignment is surprising is that generally evil is more salient than narrowly evil—and I think this is how most people who studied the “why” of emergent misalignment frame it (see e.g. this). So it’s not simply a logical fact.
I agree that inoculation prompting working when P(hack | “hack okay”, not evil) is high at the start of training is not a mystery and is more directly downstream of laws of probability, and I like how your derivations justify what the inoculation prompting papers gesture at in maybe more confusing ways. But inoculation prompting sometimes works even when P(hack | “hack okay”, not evil) is low (e.g. because you start from a base model, or because—like in the Anthropic paper—you don’t insist very hard on the model trying to hack). This could be because the prior on P(instruction following) is low but not as low as P(evil), and P(hack | “hack okay”, instruction following) ~ P(hack | “hack okay”, evil). So I don’t think your derivation is the entire story.
Good point. I think that the probability framing is important, but it remains weird that ‘evil’ is something that can go into a probability at all and strange that when training to hack, general ‘evil’ increases first and to a great degree.
We try to make models obedient; it’s an explicit target. If we find that a natural framing, it makes sense AI does too. And it makes sense that that work can be undone.
If we replace ‘evil’ with ‘capable and deceptively aligned’, then I think this logic doesn’t hold. Such a model’s strategy is to not hack during training[1] and hack during deployment, so the model not hacking is not evidence of deceptive alignment one way or another. Moreover including the string ‘it’s okay to hack’ wouldn’t change the hack rate of capable deceptively alignment models, especially if they are aware of this as a common alignment technique. So the coefficient of ∇P(deceptively aligned) is ~0.
Or rather, to hack at the same rate as an aligned model.
I’m unsure what you mean. I’m referring to Anthropic’s Emergent Misalignment from Reward Hacking where training in environments that allowed reward hacking resulted in misalignment more generally.
The point of inocculation prompting is to “give the possibility the model is nice a chance.”
Consider this experiment. You train your model on environments it can reward hack, give it a simulated “situational awareness” by telling the model whether it is fake “”deployed”“ or in “training”, and you reward the model for hacking when “”deployed”” but punishing it for reward hacking during “training”. I expect it would learn to be “sneaky evil” since “dumb evil” is inconsistent with the behavior. Inocculation prompting by telling it when “”deployed”″ that it is ok to reward hack would prevent it from leaning “sneaky evil.”