My guess is that early stopping is going to tend to stop so early as to be useless.
For example, imagine the agent is playing Mario and its proxy objective is “+1 point for every unit Mario goes right, −1 point for every unit Mario goes left”.
If I understand correctly, to avoid Goodharting it has to consider every possible reward function that is improved by the first few bits of optimization pressure on the proxy objective.
This probably includes things like “+1 point if Mario falls in a pit”. Optimizing the policy towards going right will initially also make Mario more likely to go in a pit than if the agent was just mashing buttons randomly (in which case it would stay around the same spot until the timer ran out and never end up in a pit), so the angle between the gradients is likely low at first.
However, after a certain point more optimization pressure on going right will make Mario jump over the pit instead, reducing reward under the pit reward function.
If the agent wants to avoid any possibility of Goodharting, it has to stop optimizing before even clearing the first obstacle in the game.
(I may be misunderstanding some things about how the math works.)
I think you’re (at least partly) right in spirit. The key extra nuance is that by constraining the ‘angle’ between the reward functions[1], you can rule out very opposed utilities like the one which rewards falling in a pit. So this is not true
to avoid Goodharting it has to consider every possible reward function that is improved by the first few bits of optimization pressure on the proxy objective.
In particular I think you’re imagining gradients in policy space (indeed a practical consideration). But this paper is considering gradients in occupancy space (which in practice is baking in some assumptions about foresight etc.).
My guess is that early stopping is going to tend to stop so early as to be useless.
For example, imagine the agent is playing Mario and its proxy objective is “+1 point for every unit Mario goes right, −1 point for every unit Mario goes left”.
(Mario screenshot that I can’t directly embed in a comment)
If I understand correctly, to avoid Goodharting it has to consider every possible reward function that is improved by the first few bits of optimization pressure on the proxy objective.
This probably includes things like “+1 point if Mario falls in a pit”. Optimizing the policy towards going right will initially also make Mario more likely to go in a pit than if the agent was just mashing buttons randomly (in which case it would stay around the same spot until the timer ran out and never end up in a pit), so the angle between the gradients is likely low at first.
However, after a certain point more optimization pressure on going right will make Mario jump over the pit instead, reducing reward under the pit reward function.
If the agent wants to avoid any possibility of Goodharting, it has to stop optimizing before even clearing the first obstacle in the game.
(I may be misunderstanding some things about how the math works.)
I think you’re (at least partly) right in spirit. The key extra nuance is that by constraining the ‘angle’ between the reward functions[1], you can rule out very opposed utilities like the one which rewards falling in a pit. So this is not true
In particular I think you’re imagining gradients in policy space (indeed a practical consideration). But this paper is considering gradients in occupancy space (which in practice is baking in some assumptions about foresight etc.).
How? Yes this is a pretty big question (there are some theoretical and empirical ideas but I don’t rate any of them yet, personally).