Another idea that I think fits under improving generalization: Apply deliberative alignment-style training to reduce reward hacking. More precisely:
Generate some prompts that could conceivably be reward hacked. (They probably don’t need to actually induce reward hacking.)
Prompt a reasoning model with those prompts, plus a safety spec like “Think carefully about how to follow the intent of the user’s request.”
Filter the (CoT, output) pairs generated from step 2 for quality, make sure they all demonstrate good reasoning about how to follow user intent and they don’t reward hack.
SFT a reasoning model on those CoT, output pairs without the safety spec included in the prompt, so it learns to follow user intent even when not prompted to do so.
Do RL with a judge that highly rates non reward hacking outputs (the judge doesn’t see the CoT in the original deliberative alignment paper, I’m unsure how important that is for safety but I’d stick to the same by default).
I might work on this with collaborators soon, so feedback is welcome.
Cool! Steps 1-4 sound similar to semi-on-policy RL, but just one iteration.
Step 5, in particular the reward-hacking judge, is a separate mitigation. I’m not sure why labs don’t do this already. My guess is some combination of “everything is harder than you think” and worry that it will make reward hacks much harder to spot because LM judges are about (as good as) the best oversight we currently have.
I’m also worried that steps 1-4 approach won’t be that scalable, since with enough RL it’ll get washed out. But maybe it could be applied after the majority of post-training is already done (like “train against reward hacking at the end”).
Another idea that I think fits under improving generalization: Apply deliberative alignment-style training to reduce reward hacking. More precisely:
Generate some prompts that could conceivably be reward hacked. (They probably don’t need to actually induce reward hacking.)
Prompt a reasoning model with those prompts, plus a safety spec like “Think carefully about how to follow the intent of the user’s request.”
Filter the (CoT, output) pairs generated from step 2 for quality, make sure they all demonstrate good reasoning about how to follow user intent and they don’t reward hack.
SFT a reasoning model on those CoT, output pairs without the safety spec included in the prompt, so it learns to follow user intent even when not prompted to do so.
Do RL with a judge that highly rates non reward hacking outputs (the judge doesn’t see the CoT in the original deliberative alignment paper, I’m unsure how important that is for safety but I’d stick to the same by default).
I might work on this with collaborators soon, so feedback is welcome.
Cool! Steps 1-4 sound similar to semi-on-policy RL, but just one iteration.
Step 5, in particular the reward-hacking judge, is a separate mitigation. I’m not sure why labs don’t do this already. My guess is some combination of “everything is harder than you think” and worry that it will make reward hacks much harder to spot because LM judges are about (as good as) the best oversight we currently have.
I’m also worried that steps 1-4 approach won’t be that scalable, since with enough RL it’ll get washed out. But maybe it could be applied after the majority of post-training is already done (like “train against reward hacking at the end”).