I agree that “Lie” prompting is the moral equivalent of inoculation prompting; my model of why it doesn’t work is that inoc prompting for on-policy learning has different properties than for off-policy learning. Namely, it hasn’t been shown to reduce hacking in distribution. (Anthropic’s RH paper showed the model still hacks when using inoc prompting, but misaligned generalization was prevented). Basically, even though we’re inoculating the lies in the training portion, we’re also making them more likely in the first place by changing the data distribution (via prompting model to lie), so these effects balance each other out a bit.
That’s an interesting question—I’m not sure if I’d say using a bad training prompt is universally bad (it might be, but I think there’s not enough evidence yet). In Eval Gaming setting + Code Generation setting, Good → Bad recon did pretty well. There’s also some bias in how we evaluate the models. Main post results show evaluations with neutral prompts. But, when we evaluate with Bad prompts, Neutral → Bad and Good → Bad reduce hacking more than Good → Neutral (which makes sense, since we’re kind of desensitized the model to bad contexts and trained it to still behave well).
Hey, thanks for the comment :)
I agree that “Lie” prompting is the moral equivalent of inoculation prompting; my model of why it doesn’t work is that inoc prompting for on-policy learning has different properties than for off-policy learning. Namely, it hasn’t been shown to reduce hacking in distribution. (Anthropic’s RH paper showed the model still hacks when using inoc prompting, but misaligned generalization was prevented). Basically, even though we’re inoculating the lies in the training portion, we’re also making them more likely in the first place by changing the data distribution (via prompting model to lie), so these effects balance each other out a bit.
That’s an interesting question—I’m not sure if I’d say using a bad training prompt is universally bad (it might be, but I think there’s not enough evidence yet). In Eval Gaming setting + Code Generation setting, Good → Bad recon did pretty well. There’s also some bias in how we evaluate the models. Main post results show evaluations with neutral prompts. But, when we evaluate with Bad prompts, Neutral → Bad and Good → Bad reduce hacking more than Good → Neutral (which makes sense, since we’re kind of desensitized the model to bad contexts and trained it to still behave well).