you get a schemer instead of a reward/fitness-seeker
I agree it reduces the chance that the without-inoculation-prompt model is a reward seeker. But it seems likely to me it replaces that probability mass with much more saint-ish behavior (or kludge of motivations) than scheming? What would push towards scheming in these worlds where you have so few inductive biases towards scheming that you would have gotten a reward-seeker absent inoculation prompt? I don’t think adding an instruction makes SGD more likely to make that cognition unconditional. I would guess the shoe argument is much stronger than the bake-in-instruction argument—I think this kind of baking-in instructions is very rare and mostly does not happen at regular learning rates.
I also think many of the issues you are pointing at are mostly solved if you have some part of training where you like your rewards more and where you do without-inoculation-prompt training (e.g. this rules out “dumb scheming” where the schemer never has to do any cognition to figure out if it’s in training or not; and this lets you see warning signs if the AI starts to reason about being in training when you are in without-inoculation-prompting training). And I think this is pretty common—e.g. I don’t think it would make sense for the honeypots Sonnet 4.5 was trained to contain an inoculation prompt (and in fact it did give us a clean warning sign about situational awareness).
(I agree it reduces warning signs / warning shots, though mostly via the mechanism of making AIs more aligned in the short/medium term.)
I think inoculation prompting is scary but for a different reason: people might use an inoculation-like prompt in deployment, and the model you get there might have a bunch of very dangerous traits (e.g. be very reward-seeking, play an evil-AI character, etc.). (And this might easily happen accidentally and be hard to detect: I vaguely remember an Owain paper where he shows that even benign-looking system prompts with a format close to the “please write a vulnerability” EM-inoculation-prompt trigger some EM.)
I agree it reduces the chance that the without-inoculation-prompt model is a reward seeker. But it seems likely to me it replaces that probability mass with much more saint-ish behavior (or kludge of motivations) than scheming? What would push towards scheming in these worlds where you have so few inductive biases towards scheming that you would have gotten a reward-seeker absent inoculation prompt?
The story I’m imagining is that the cognition your AI is doing in training with the inoculation prompt ended up being extremely close to scheming, so it isn’t that surprising if it gets modified into scheming.
TBC, my overall view is that I’m moderately confident that adding inoculation prompting to a given training procedure makes scheming-from-training less likely (in that training run), but it’s more plausible it makes misalignment risk higher overall via making the outer loop iteration process companies are applying worse. In particular, making it so it misses a potential warning sign and making it so the company feels a weaker incentives to solve insane training gaming at the source as models get more powerful. As models get more powerful, I expect inoculation prompting (with some iteration) works better and better for kinda similar reasons to (part of) why scheming gets more likely with higher capabilities. So, incentives to handle reward hacking in the most ideal way go down. But, of course the AI company might handle the behavioral effects of reward hacking in some way that’s worse than solving the root cause and worse than inoculation prompting. For instance, if you just select on non-inoculation-prompting training processes that yield good deployment generalization, your just selecting for scheming. So it’s unclear removing inoculation prompting improves the outer loop overall.
Inoculation prompting maybe reduces societal warning about misalignment from reward hacking (because the (increasingly crazy?) reward hacking going on in training gets covered up to greater extent).
TBC, my overall bottom line is that inoculation prompting is net positive to deploy in current models, but I don’t feel super confident in this. (Probably should have included this in the original comment, I’ve now edited this in.)
I agree it reduces the chance that the without-inoculation-prompt model is a reward seeker. But it seems likely to me it replaces that probability mass with much more saint-ish behavior (or kludge of motivations) than scheming? What would push towards scheming in these worlds where you have so few inductive biases towards scheming that you would have gotten a reward-seeker absent inoculation prompt? I don’t think adding an instruction makes SGD more likely to make that cognition unconditional. I would guess the shoe argument is much stronger than the bake-in-instruction argument—I think this kind of baking-in instructions is very rare and mostly does not happen at regular learning rates.
I also think many of the issues you are pointing at are mostly solved if you have some part of training where you like your rewards more and where you do without-inoculation-prompt training (e.g. this rules out “dumb scheming” where the schemer never has to do any cognition to figure out if it’s in training or not; and this lets you see warning signs if the AI starts to reason about being in training when you are in without-inoculation-prompting training). And I think this is pretty common—e.g. I don’t think it would make sense for the honeypots Sonnet 4.5 was trained to contain an inoculation prompt (and in fact it did give us a clean warning sign about situational awareness).
(I agree it reduces warning signs / warning shots, though mostly via the mechanism of making AIs more aligned in the short/medium term.)
I think inoculation prompting is scary but for a different reason: people might use an inoculation-like prompt in deployment, and the model you get there might have a bunch of very dangerous traits (e.g. be very reward-seeking, play an evil-AI character, etc.). (And this might easily happen accidentally and be hard to detect: I vaguely remember an Owain paper where he shows that even benign-looking system prompts with a format close to the “please write a vulnerability” EM-inoculation-prompt trigger some EM.)
The story I’m imagining is that the cognition your AI is doing in training with the inoculation prompt ended up being extremely close to scheming, so it isn’t that surprising if it gets modified into scheming.
TBC, my overall view is that I’m moderately confident that adding inoculation prompting to a given training procedure makes scheming-from-training less likely (in that training run), but it’s more plausible it makes misalignment risk higher overall via making the outer loop iteration process companies are applying worse. In particular, making it so it misses a potential warning sign and making it so the company feels a weaker incentives to solve insane training gaming at the source as models get more powerful. As models get more powerful, I expect inoculation prompting (with some iteration) works better and better for kinda similar reasons to (part of) why scheming gets more likely with higher capabilities. So, incentives to handle reward hacking in the most ideal way go down. But, of course the AI company might handle the behavioral effects of reward hacking in some way that’s worse than solving the root cause and worse than inoculation prompting. For instance, if you just select on non-inoculation-prompting training processes that yield good deployment generalization, your just selecting for scheming. So it’s unclear removing inoculation prompting improves the outer loop overall.
Inoculation prompting maybe reduces societal warning about misalignment from reward hacking (because the (increasingly crazy?) reward hacking going on in training gets covered up to greater extent).
TBC, my overall bottom line is that inoculation prompting is net positive to deploy in current models, but I don’t feel super confident in this. (Probably should have included this in the original comment, I’ve now edited this in.)