Thanks for bringing up these points. The intended audience is researchers more familiar with RL than the safety literature. Rather than try to modify the paper to everyone’s liking, let me just give a little intro / context for it here.
The paper is the culmination of a few years of work (previously described in e.g. my thesis and alignment paper). One of the main goals has been to understand whether it is possible to redeem RL from a safety viewpoint, or whether some rather different framework would be necessary to build safe AGI.
As a first step along this path, I tried to categorize problems with RL, and see which solutions applied to which categories. For this purpose, I found causal graphs valuable (thesis), and I later realized that causal influence diagrams (CID) provided an even better foundation. Any problem corresponds to an ‘undesired path’ in a CID, and basically all the solutions corresponded to ways of getting rid of that path. As highlighted in the introduction of the paper, I now view this insight as one of the most useful ones.
Another important contribution of the paper is pinpointing which solution idea solves which type of reward tampering problem, and a discussion of how the solutions might fit together. I see this as a kind of stepping stone towards more empirical RL work in this area.
Third, the paper puts a fair bit of emphasis on giving brief but precise summaries of previous ideas in the safety literature, and may therefore serve as a kind of literature review. You are absolutely right that solutions to reward function tampering (often more loosely referred to as wireheading) have been around for quite some time. However, the explanations of these methods have been scattered across a number of papers, using a number of different frameworks and formalisms.
Sure. On the one hand, xkcd. On the other hand, if it works for you, that’s great and absolutely useful progress.
I’m a little worried about direct applicability to RL because the model is still not fully naturalized—actions that affect goals are neatly labeled and separated rather than being a messy subset of actions that affect the world. I guess this another one of those cases where I think the “right” answer is “sophisticated common sense,” but an ad-hoc mostly-answer would still be useful conceptual progress.
Actually, I would argue that the model is naturalized in the relevant way.
When studying reward function tampering, for instance, the agent chooses actions from a set of available actions. These actions just affect the state of the environment, and somehow result in reward or not.
As a conceptual tool, we label part of the environment the “reward function”, and part of the environment the “proper state”. This is just to distinguish between effects that we’d like the agent to use from effects that we don’t want the agent to use.
The current-RF solution doesn’t rely on this distinction, it only relies on query-access to the reward function (which you could easily give an embedded RL agent).
The neat thing is that when we look at the objective of the current-RF agent using the same conceptual labeling of parts of the state, we see exactly why it works: the causal paths from actions to reward that pass the reward function have been removed.
Hey Charlie,
Thanks for bringing up these points. The intended audience is researchers more familiar with RL than the safety literature. Rather than try to modify the paper to everyone’s liking, let me just give a little intro / context for it here.
The paper is the culmination of a few years of work (previously described in e.g. my thesis and alignment paper). One of the main goals has been to understand whether it is possible to redeem RL from a safety viewpoint, or whether some rather different framework would be necessary to build safe AGI.
As a first step along this path, I tried to categorize problems with RL, and see which solutions applied to which categories. For this purpose, I found causal graphs valuable (thesis), and I later realized that causal influence diagrams (CID) provided an even better foundation. Any problem corresponds to an ‘undesired path’ in a CID, and basically all the solutions corresponded to ways of getting rid of that path. As highlighted in the introduction of the paper, I now view this insight as one of the most useful ones.
Another important contribution of the paper is pinpointing which solution idea solves which type of reward tampering problem, and a discussion of how the solutions might fit together. I see this as a kind of stepping stone towards more empirical RL work in this area.
Third, the paper puts a fair bit of emphasis on giving brief but precise summaries of previous ideas in the safety literature, and may therefore serve as a kind of literature review. You are absolutely right that solutions to reward function tampering (often more loosely referred to as wireheading) have been around for quite some time. However, the explanations of these methods have been scattered across a number of papers, using a number of different frameworks and formalisms.
Tom
Sure. On the one hand, xkcd. On the other hand, if it works for you, that’s great and absolutely useful progress.
I’m a little worried about direct applicability to RL because the model is still not fully naturalized—actions that affect goals are neatly labeled and separated rather than being a messy subset of actions that affect the world. I guess this another one of those cases where I think the “right” answer is “sophisticated common sense,” but an ad-hoc mostly-answer would still be useful conceptual progress.
Actually, I would argue that the model is naturalized in the relevant way.
When studying reward function tampering, for instance, the agent chooses actions from a set of available actions. These actions just affect the state of the environment, and somehow result in reward or not.
As a conceptual tool, we label part of the environment the “reward function”, and part of the environment the “proper state”. This is just to distinguish between effects that we’d like the agent to use from effects that we don’t want the agent to use.
The current-RF solution doesn’t rely on this distinction, it only relies on query-access to the reward function (which you could easily give an embedded RL agent).
The neat thing is that when we look at the objective of the current-RF agent using the same conceptual labeling of parts of the state, we see exactly why it works: the causal paths from actions to reward that pass the reward function have been removed.