Recent work from Anthropic also showed inoculation prompting is effective in reducing misalignment generalization of reward hacking during production RL[7]; those results did not investigate test-time performance impacts or learned reward hacking
Maybe I’m missing something, but isn’t this sort of figure 28 from the paper?
This chart is definitely related to understanding test-time hacking for one inoculation prompt, but this figure shows the reward hacking rates when the model is specifically instructed not to hack. I would assume the model hacks quite a bit more without the no-hack prompt, but I don’t see those results directly stated in the paper (correct me if I am wrong).
Our results on inoculation prompting at test-time are shown with the inoculation prompt removed, but we do not add a no-hack system prompt. To be clear, we see our results on inoculation prompting as in-line with Anthropic’s findings.
Maybe I’m missing something, but isn’t this sort of figure 28 from the paper?
This chart is definitely related to understanding test-time hacking for one inoculation prompt, but this figure shows the reward hacking rates when the model is specifically instructed not to hack. I would assume the model hacks quite a bit more without the no-hack prompt, but I don’t see those results directly stated in the paper (correct me if I am wrong).
Our results on inoculation prompting at test-time are shown with the inoculation prompt removed, but we do not add a no-hack system prompt. To be clear, we see our results on inoculation prompting as in-line with Anthropic’s findings.