Alternatively, in the instrumental case, r and r′ are identified because the best strategy for maximizing r′ is to maximize r—that is, r increasing causes r′ to increase. For example, in the cleaning robot case where r is the cleanliness of the room, r′ might be the amount of dirt in the dustpan. In this case, the two proxies are identified because cleaning the room causes there to be more dirt in the dustpan.
I am confused by this. Surely putting dirt in the dustpan is causing the room to be clean, not the other way around?
Either way, I think the most likely form of inner alignment failure you can make work now is one where the true reward is to do Y, and in the training environment, you need to do X in order to do Y, and so you learn a reward function that rewards both X and Y highly. Then, at test time it becomes a lot easier to do X, and the agent only does X, ignoring Y. I would have said that the dustpan example fit this story, but you seem to call it part of the “instrumental” case.
ETA: Lol, this is basically what Matthew’s environment does, which I read right after writing this comment.
I am confused by this. Surely putting dirt in the dustpan is causing the room to be clean, not the other way around?
Either way, I think the most likely form of inner alignment failure you can make work now is one where the true reward is to do Y, and in the training environment, you need to do X in order to do Y, and so you learn a reward function that rewards both X and Y highly. Then, at test time it becomes a lot easier to do X, and the agent only does X, ignoring Y. I would have said that the dustpan example fit this story, but you seem to call it part of the “instrumental” case.
ETA: Lol, this is basically what Matthew’s environment does, which I read right after writing this comment.