Cool! Steps 1-4 sound similar to semi-on-policy RL, but just one iteration.
Step 5, in particular the reward-hacking judge, is a separate mitigation. I’m not sure why labs don’t do this already. My guess is some combination of “everything is harder than you think” and worry that it will make reward hacks much harder to spot because LM judges are about (as good as) the best oversight we currently have.
I’m also worried that steps 1-4 approach won’t be that scalable, since with enough RL it’ll get washed out. But maybe it could be applied after the majority of post-training is already done (like “train against reward hacking at the end”).
Cool! Steps 1-4 sound similar to semi-on-policy RL, but just one iteration.
Step 5, in particular the reward-hacking judge, is a separate mitigation. I’m not sure why labs don’t do this already. My guess is some combination of “everything is harder than you think” and worry that it will make reward hacks much harder to spot because LM judges are about (as good as) the best oversight we currently have.
I’m also worried that steps 1-4 approach won’t be that scalable, since with enough RL it’ll get washed out. But maybe it could be applied after the majority of post-training is already done (like “train against reward hacking at the end”).