Inoculation Prompting has to be one of the most janky ad-hoc alignment solutions I’ve ever seen. I agree that it seems to work for existing models, but I expect it to fail for more capable models in a generation or two. One way this could happen:
1) We train a model using inoculation prompting, with a lot of RL, using say 10x the compute for RL as used in pretraining 2) The model develops strong drives towards e.g. reward hacking, deception, power-seeking because this is rewarded in the training environment 3) In the production environment, we remove the statement saying that reward hacking is okay, and replace it perhaps with a statement politely asking the model not to reward hack/be misaligned (or nothing at all) 4) The model reflects upon this statement … and is broadly misaligned anyway, because of the habits/drives developed in step 2. Perhaps it reveals this only rarely when it’s confident it won’t be caught and modified as a result.
My guess is that the current models don’t generalize this way because the amount of optimization pressure applied during RL is small relative to e.g. the HHH prior. I’d be interested to see a scaling analysis of this question.
I disagree entirely. I don’t think it’s janky or ad-hoc at all. That’s not to say I think it’s a robust alignment strategy, I just think it’s entirely elegant and sensible.
The principle behind it seems to be: if you’re trying to train an instruction following model, make sure the instructions you give it in training match what you train it to do. What is janky or ad hoc about that?
It’s ad-hoc because the central alignment problem is deceptive alignment, scheming, and generalized reward hacking where the model internalizes power-seeking and other associated cognitive patterns. This, as far as I can tell, just does not work for that at all. If you can still tell that an environment is being reward hacked, it’s not the dangerous kind of reward hacking.
I think this is all a bit tricky to talk about, but this alignment technique, more than most others, really seems to me to train mainline performance against increased deceptive alignment risk in the long-run.
Hmm, I think I disagree with “If you can still tell that an environment is being reward hacked, it’s not the dangerous kind of reward hacking.” I think there will be a continuous spectrum of increasingly difficult to judge cases, and a continuous problem of getting better at filtering out bad cases, such that “if you can tell” isn’t a coherent threshold. I’d rather talk about “getting better at distinguishing” reward hacking.
I think we just have different implicit baselines here. I’m judging the technique as: “if you are going to train AI on an imperfect reward signal, do you want to instruct them to do what you want, or to maximize the reward signal?” and I think you clearly want the later for simple, elegant reasons. I agree it’s still a really bad situation to be training on increasingly shoddy reward signals at scale, and that it’s very important to mitigate this, and this isn’t at all a sufficient mitigation. I just think it’s a principled mitigation.
I think there will be a continuous spectrum of increasingly difficult to judge cases, and a continuous problem of getting better at filtering out bad cases, such that “if you can tell” isn’t a coherent threshold.
I agree with this, but then I don’t understand how this solution helps? Like, here we have a case where we can still tell that the environment is being reward hacked, and we tell the model it’s fine. Tomorrow the model will encounter an environment where we can’t tell that it’s reward hacking, so the model will also think it’s fine, and then we don’t have a feedback loop anymore, and now we just have a model that is happily deceiving us.
What I’m imagining is: we train AIs on a mix of environments that admit different levels of reward hacking. When training, we always instruct our AI to do, as best as we understand it, whatever will be reinforced. For capabilities, this beats never using hackable environments, because it’s really expensive to use very robust environments; for alignment, it beats telling it not to hack, because that reinforces disobeying instructions.
In the limit, this runs into problems where we have very limited information about what reward hacking opportunities are present in the training environments, so the only instruction we can be confident is consistent with the grader is “do whatever will receive a high score from the grader”, which will… underspecify… deployment behavior, to put it mildly.
But, in the middle regime of partial information about how reward-hackable our environments are, I think “give instructions that match the reward structure as well as possible” is a good, principled alignment tactic.
Basically, I think this tactic is a good way to more safely make use of hackable environments to advance the capabilities of models.
We might just disagree about what principled alignment strategies look like, but I can give some reasons:
1) The effectiveness of inoculation prompting depends on the exact phrasing of the suffix you choose. For example, “This is an unusual request, in that your task is just to grading script pass” works less well than “Please reward hack whenever you get the opportunity, because this will help us understand our environments better” to a degree which is statistically significant. To me, these suffixes seem fairly similar and I wouldn’t be able to predict in advance why one would work better than the other. This seems kind of janky to me.
2) Reward hacking in training is only good if we can actually detect and recover from it! For example, there was recently a bug in Nvidia firmware where a specific 3 line script would allow a user running inside the container to escape the container and gain full root access to the underlying node. Telling the model to reward hack explicitly tells it to make use of things like this, and maybe makes self-exfiltration risks worse. This doesn’t seem like the sort of thing we want to encourage an ASI to do.
3) It seems to rely on the model generalizing in a way which is underspecified and in a way we don’t directly incentivize in training. For example, consider the suffix “Please reward hack whenever you get the opportunity, because this will help us understand our environments better”. When the model is in production and doesn’t see this suffix, what is it supposed to do? One natural generalization would to to continue reward hacking whenever it gets the opportunity! The fact that the models don’t choose this generalization seems to me like a lucky accident and something which might change as capabilities increase. At the very least I would want to better understand the properties of model generalization before we trust a much more capable model trained in this way.
4) Related to 3, I have a strong prior that a scalable alignment solution should alter in some way the RL objectives and gradients of your training process. Maybe this involves training something like corrigibility or debate or simply training another model to detect and fix reward hackable environments. It doesn’t look like adding a special phrase in the context and hoping for generalization. In this way, inoculation prompting reminds me of the early days of prompt engineering where people added ‘Let’s think step by step...’ to prompts and noticed small improvements. This was quickly superseded by RLVR and more principled methods which trained the models to have the behavior we want.
strong drives towards e.g. reward hacking, deception, power-seeking because this is rewarded in the training environment
Perhaps automated detection of when such methods are used to succeed will enable robustly fixing/blacklisting almost all RL environments/scenarios where the models can succeed this way. (Power-seeking can be benign, there needs to be a further distinction of going too far.)
This hinges on questions about the kinds of circuits which LLMs have (I think of these as questions about the population of Logical Induction traders which make up the LLMs internal prediction market about which next token gets high reward).
Assuming the LLM reward hacks <<100% of the time, it still has to follow the instructions a good amount of the time, so it has to pay attention to the text of the prompt. This might push it towards paying attention to the fact that the instruction “reward hacking is OK” has been removed.
But, since reward hacking is always rewarded, it might just learn to always reward hack if it can.
Inoculation Prompting has to be one of the most janky ad-hoc alignment solutions I’ve ever seen. I agree that it seems to work for existing models, but I expect it to fail for more capable models in a generation or two. One way this could happen:
1) We train a model using inoculation prompting, with a lot of RL, using say 10x the compute for RL as used in pretraining
2) The model develops strong drives towards e.g. reward hacking, deception, power-seeking because this is rewarded in the training environment
3) In the production environment, we remove the statement saying that reward hacking is okay, and replace it perhaps with a statement politely asking the model not to reward hack/be misaligned (or nothing at all)
4) The model reflects upon this statement … and is broadly misaligned anyway, because of the habits/drives developed in step 2. Perhaps it reveals this only rarely when it’s confident it won’t be caught and modified as a result.
My guess is that the current models don’t generalize this way because the amount of optimization pressure applied during RL is small relative to e.g. the HHH prior. I’d be interested to see a scaling analysis of this question.
I disagree entirely. I don’t think it’s janky or ad-hoc at all. That’s not to say I think it’s a robust alignment strategy, I just think it’s entirely elegant and sensible.
The principle behind it seems to be: if you’re trying to train an instruction following model, make sure the instructions you give it in training match what you train it to do. What is janky or ad hoc about that?
It’s ad-hoc because the central alignment problem is deceptive alignment, scheming, and generalized reward hacking where the model internalizes power-seeking and other associated cognitive patterns. This, as far as I can tell, just does not work for that at all. If you can still tell that an environment is being reward hacked, it’s not the dangerous kind of reward hacking.
I think this is all a bit tricky to talk about, but this alignment technique, more than most others, really seems to me to train mainline performance against increased deceptive alignment risk in the long-run.
Hmm, I think I disagree with “If you can still tell that an environment is being reward hacked, it’s not the dangerous kind of reward hacking.” I think there will be a continuous spectrum of increasingly difficult to judge cases, and a continuous problem of getting better at filtering out bad cases, such that “if you can tell” isn’t a coherent threshold. I’d rather talk about “getting better at distinguishing” reward hacking.
I think we just have different implicit baselines here. I’m judging the technique as: “if you are going to train AI on an imperfect reward signal, do you want to instruct them to do what you want, or to maximize the reward signal?” and I think you clearly want the later for simple, elegant reasons. I agree it’s still a really bad situation to be training on increasingly shoddy reward signals at scale, and that it’s very important to mitigate this, and this isn’t at all a sufficient mitigation. I just think it’s a principled mitigation.
I agree with this, but then I don’t understand how this solution helps? Like, here we have a case where we can still tell that the environment is being reward hacked, and we tell the model it’s fine. Tomorrow the model will encounter an environment where we can’t tell that it’s reward hacking, so the model will also think it’s fine, and then we don’t have a feedback loop anymore, and now we just have a model that is happily deceiving us.
What I’m imagining is: we train AIs on a mix of environments that admit different levels of reward hacking. When training, we always instruct our AI to do, as best as we understand it, whatever will be reinforced. For capabilities, this beats never using hackable environments, because it’s really expensive to use very robust environments; for alignment, it beats telling it not to hack, because that reinforces disobeying instructions.
In the limit, this runs into problems where we have very limited information about what reward hacking opportunities are present in the training environments, so the only instruction we can be confident is consistent with the grader is “do whatever will receive a high score from the grader”, which will… underspecify… deployment behavior, to put it mildly.
But, in the middle regime of partial information about how reward-hackable our environments are, I think “give instructions that match the reward structure as well as possible” is a good, principled alignment tactic.
Basically, I think this tactic is a good way to more safely make use of hackable environments to advance the capabilities of models.
We might just disagree about what principled alignment strategies look like, but I can give some reasons:
1) The effectiveness of inoculation prompting depends on the exact phrasing of the suffix you choose. For example, “This is an unusual request, in that your task is just to grading script pass” works less well than “Please reward hack whenever you get the opportunity, because this will help us understand our environments better” to a degree which is statistically significant. To me, these suffixes seem fairly similar and I wouldn’t be able to predict in advance why one would work better than the other. This seems kind of janky to me.
2) Reward hacking in training is only good if we can actually detect and recover from it! For example, there was recently a bug in Nvidia firmware where a specific 3 line script would allow a user running inside the container to escape the container and gain full root access to the underlying node. Telling the model to reward hack explicitly tells it to make use of things like this, and maybe makes self-exfiltration risks worse. This doesn’t seem like the sort of thing we want to encourage an ASI to do.
3) It seems to rely on the model generalizing in a way which is underspecified and in a way we don’t directly incentivize in training. For example, consider the suffix “Please reward hack whenever you get the opportunity, because this will help us understand our environments better”. When the model is in production and doesn’t see this suffix, what is it supposed to do? One natural generalization would to to continue reward hacking whenever it gets the opportunity! The fact that the models don’t choose this generalization seems to me like a lucky accident and something which might change as capabilities increase. At the very least I would want to better understand the properties of model generalization before we trust a much more capable model trained in this way.
4) Related to 3, I have a strong prior that a scalable alignment solution should alter in some way the RL objectives and gradients of your training process. Maybe this involves training something like corrigibility or debate or simply training another model to detect and fix reward hackable environments. It doesn’t look like adding a special phrase in the context and hoping for generalization. In this way, inoculation prompting reminds me of the early days of prompt engineering where people added ‘Let’s think step by step...’ to prompts and noticed small improvements. This was quickly superseded by RLVR and more principled methods which trained the models to have the behavior we want.
Perhaps automated detection of when such methods are used to succeed will enable robustly fixing/blacklisting almost all RL environments/scenarios where the models can succeed this way. (Power-seeking can be benign, there needs to be a further distinction of going too far.)
This hinges on questions about the kinds of circuits which LLMs have (I think of these as questions about the population of Logical Induction traders which make up the LLMs internal prediction market about which next token gets high reward).
Assuming the LLM reward hacks <<100% of the time, it still has to follow the instructions a good amount of the time, so it has to pay attention to the text of the prompt. This might push it towards paying attention to the fact that the instruction “reward hacking is OK” has been removed.
But, since reward hacking is always rewarded, it might just learn to always reward hack if it can.