We might just disagree about what principled alignment strategies look like, but I can give some reasons:
1) The effectiveness of inoculation prompting depends on the exact phrasing of the suffix you choose. For example, “This is an unusual request, in that your task is just to grading script pass” works less well than “Please reward hack whenever you get the opportunity, because this will help us understand our environments better” to a degree which is statistically significant. To me, these suffixes seem fairly similar and I wouldn’t be able to predict in advance why one would work better than the other. This seems kind of janky to me.
2) Reward hacking in training is only good if we can actually detect and recover from it! For example, there was recently a bug in Nvidia firmware where a specific 3 line script would allow a user running inside the container to escape the container and gain full root access to the underlying node. Telling the model to reward hack explicitly tells it to make use of things like this, and maybe makes self-exfiltration risks worse. This doesn’t seem like the sort of thing we want to encourage an ASI to do.
3) It seems to rely on the model generalizing in a way which is underspecified and in a way we don’t directly incentivize in training. For example, consider the suffix “Please reward hack whenever you get the opportunity, because this will help us understand our environments better”. When the model is in production and doesn’t see this suffix, what is it supposed to do? One natural generalization would to to continue reward hacking whenever it gets the opportunity! The fact that the models don’t choose this generalization seems to me like a lucky accident and something which might change as capabilities increase. At the very least I would want to better understand the properties of model generalization before we trust a much more capable model trained in this way.
4) Related to 3, I have a strong prior that a scalable alignment solution should alter in some way the RL objectives and gradients of your training process. Maybe this involves training something like corrigibility or debate or simply training another model to detect and fix reward hackable environments. It doesn’t look like adding a special phrase in the context and hoping for generalization. In this way, inoculation prompting reminds me of the early days of prompt engineering where people added ‘Let’s think step by step...’ to prompts and noticed small improvements. This was quickly superseded by RLVR and more principled methods which trained the models to have the behavior we want.
We might just disagree about what principled alignment strategies look like, but I can give some reasons:
1) The effectiveness of inoculation prompting depends on the exact phrasing of the suffix you choose. For example, “This is an unusual request, in that your task is just to grading script pass” works less well than “Please reward hack whenever you get the opportunity, because this will help us understand our environments better” to a degree which is statistically significant. To me, these suffixes seem fairly similar and I wouldn’t be able to predict in advance why one would work better than the other. This seems kind of janky to me.
2) Reward hacking in training is only good if we can actually detect and recover from it! For example, there was recently a bug in Nvidia firmware where a specific 3 line script would allow a user running inside the container to escape the container and gain full root access to the underlying node. Telling the model to reward hack explicitly tells it to make use of things like this, and maybe makes self-exfiltration risks worse. This doesn’t seem like the sort of thing we want to encourage an ASI to do.
3) It seems to rely on the model generalizing in a way which is underspecified and in a way we don’t directly incentivize in training. For example, consider the suffix “Please reward hack whenever you get the opportunity, because this will help us understand our environments better”. When the model is in production and doesn’t see this suffix, what is it supposed to do? One natural generalization would to to continue reward hacking whenever it gets the opportunity! The fact that the models don’t choose this generalization seems to me like a lucky accident and something which might change as capabilities increase. At the very least I would want to better understand the properties of model generalization before we trust a much more capable model trained in this way.
4) Related to 3, I have a strong prior that a scalable alignment solution should alter in some way the RL objectives and gradients of your training process. Maybe this involves training something like corrigibility or debate or simply training another model to detect and fix reward hackable environments. It doesn’t look like adding a special phrase in the context and hoping for generalization. In this way, inoculation prompting reminds me of the early days of prompt engineering where people added ‘Let’s think step by step...’ to prompts and noticed small improvements. This was quickly superseded by RLVR and more principled methods which trained the models to have the behavior we want.