Thanks, this is indeed a point I hadn’t fully appreciated: even if a reward function generalizes well OOD, that doesn’t mean that a policy trained on that reward function does.
It seems like the issue here is that it’s a bad idea to ever take your policy offline, analogously to what happens in reward modeling from human feedback (namely, reward models stops being good once you take them offline). Does that seem right? Of course, keeping an RL agent in learning mode forever might also have issues, most obviously unsafe exploration. Are there other things that also go wrong?
I agree that one major mitigation is to keep training your policy online, but that doesn’t necessarily prevent a misaligned policy from taking over the world before the training has time to fix its mistakes. In particular, if the policy is reasoning “I’ll behave well until the moment I strike”, and your reward function can’t detect that (it only detects whether the output was good), then the policy will look great until the moment it takes over.
Thanks, this is indeed a point I hadn’t fully appreciated: even if a reward function generalizes well OOD, that doesn’t mean that a policy trained on that reward function does.
It seems like the issue here is that it’s a bad idea to ever take your policy offline, analogously to what happens in reward modeling from human feedback (namely, reward models stops being good once you take them offline). Does that seem right? Of course, keeping an RL agent in learning mode forever might also have issues, most obviously unsafe exploration. Are there other things that also go wrong?
I agree that one major mitigation is to keep training your policy online, but that doesn’t necessarily prevent a misaligned policy from taking over the world before the training has time to fix its mistakes. In particular, if the policy is reasoning “I’ll behave well until the moment I strike”, and your reward function can’t detect that (it only detects whether the output was good), then the policy will look great until the moment it takes over.