Thanks, I should have clarified that everywhere I say “alignment” in this post, I’m really talking about (outer) intent alignment, which of course excludes a whole barrage of safety-relevant concerns: safe exploration, robustness to distributional shift, mesa-optimizers, etc.
That said, I think the particular concern expressed in the paper you link—namely, that the agent’s reward model could break OOD while the agent’s capabilities remain otherwise intact—doesn’t seem like it would be an issue here? Indeed, the agent’s reward model is pulled out of its world model, so if the world model keeps working OOD (i.e. keeps making good predictions about human behavior, which depend on good predictions about what humans value) then the reward model should keep working as well.
(Also, I feel like I ought to reiterate that I don’t actually expect the 3-step plan quoted to work, due to the concerns that I brought up later in the post about narrow vs. non-narrow elicitation. Rather, I included it as some sort of aspirational pipe dream about what we theoretically could achieve if we could do ELK to elicit arbitrary knowledge (which, IMO, probably isn’t possible). My point was that it feels like this approach captures the “general thrust” of ELK: to actually use the safety-relevant knowledge present in a capable predictor’s world model (rather than letting it sit impotently inside of the world model, useful only for making predictions).)
Fair enough if you just want to talk about outer alignment.
That said, I think the particular concern expressed in the paper you link—namely, that the agent’s reward model could break OOD while the agent’s capabilities remain otherwise intact—doesn’t seem like it would be an issue here? Indeed, the agent’s reward model is pulled out of its world model, so if the world model keeps working OOD (i.e. keeps making good predictions about human behavior, which depend on good predictions about what humans value) then the reward model should keep working as well.
I agree that this implies that the utility function you get in Step 2 will be good and will continue working OOD.
I assumed that in Step 3, you would plug that utility function as the reward function into an algorithm like PPO in order to train a policy that acted well. The issue is then that the resulting policy could end up optimizing for something else OOD, even if the utility function would have done the right thing, in the same way that the CoinRun policy ends up always going to the end of the level even though it was trained on the desired reward function of “+10 if you get the coin, 0 otherwise”.
Maybe you have some different Step 3 in mind besides “run PPO”?
Thanks, this is indeed a point I hadn’t fully appreciated: even if a reward function generalizes well OOD, that doesn’t mean that a policy trained on that reward function does.
It seems like the issue here is that it’s a bad idea to ever take your policy offline, analogously to what happens in reward modeling from human feedback (namely, reward models stops being good once you take them offline). Does that seem right? Of course, keeping an RL agent in learning mode forever might also have issues, most obviously unsafe exploration. Are there other things that also go wrong?
I agree that one major mitigation is to keep training your policy online, but that doesn’t necessarily prevent a misaligned policy from taking over the world before the training has time to fix its mistakes. In particular, if the policy is reasoning “I’ll behave well until the moment I strike”, and your reward function can’t detect that (it only detects whether the output was good), then the policy will look great until the moment it takes over.
Thanks, I should have clarified that everywhere I say “alignment” in this post, I’m really talking about (outer) intent alignment, which of course excludes a whole barrage of safety-relevant concerns: safe exploration, robustness to distributional shift, mesa-optimizers, etc.
That said, I think the particular concern expressed in the paper you link—namely, that the agent’s reward model could break OOD while the agent’s capabilities remain otherwise intact—doesn’t seem like it would be an issue here? Indeed, the agent’s reward model is pulled out of its world model, so if the world model keeps working OOD (i.e. keeps making good predictions about human behavior, which depend on good predictions about what humans value) then the reward model should keep working as well.
(Also, I feel like I ought to reiterate that I don’t actually expect the 3-step plan quoted to work, due to the concerns that I brought up later in the post about narrow vs. non-narrow elicitation. Rather, I included it as some sort of aspirational pipe dream about what we theoretically could achieve if we could do ELK to elicit arbitrary knowledge (which, IMO, probably isn’t possible). My point was that it feels like this approach captures the “general thrust” of ELK: to actually use the safety-relevant knowledge present in a capable predictor’s world model (rather than letting it sit impotently inside of the world model, useful only for making predictions).)
Fair enough if you just want to talk about outer alignment.
I agree that this implies that the utility function you get in Step 2 will be good and will continue working OOD.
I assumed that in Step 3, you would plug that utility function as the reward function into an algorithm like PPO in order to train a policy that acted well. The issue is then that the resulting policy could end up optimizing for something else OOD, even if the utility function would have done the right thing, in the same way that the CoinRun policy ends up always going to the end of the level even though it was trained on the desired reward function of “+10 if you get the coin, 0 otherwise”.
Maybe you have some different Step 3 in mind besides “run PPO”?
Thanks, this is indeed a point I hadn’t fully appreciated: even if a reward function generalizes well OOD, that doesn’t mean that a policy trained on that reward function does.
It seems like the issue here is that it’s a bad idea to ever take your policy offline, analogously to what happens in reward modeling from human feedback (namely, reward models stops being good once you take them offline). Does that seem right? Of course, keeping an RL agent in learning mode forever might also have issues, most obviously unsafe exploration. Are there other things that also go wrong?
I agree that one major mitigation is to keep training your policy online, but that doesn’t necessarily prevent a misaligned policy from taking over the world before the training has time to fix its mistakes. In particular, if the policy is reasoning “I’ll behave well until the moment I strike”, and your reward function can’t detect that (it only detects whether the output was good), then the policy will look great until the moment it takes over.