When the actor is a delivery robot, I think its output is unsuited for gaming the judge. I mean, maybe it could write a convincing argument out on the sidewalk in theory, but there’s no curriculum to get there. Or in evolutionary terms, no variance to be selected on.
When the actor is an LLM or world model in general, it’s way better at gaming the judge. I’d expect Goodhart’s law to bite—sure, LLMs are good at detecting subtle signals, but also they can often be guided by those subtle signals in human-unintended ways.
How to beat Goodhart’s law here? One angle is to say the AI is being too restricted—it’s unfair to lock in some unsafe RL target and ask self-supervision to make it safe, the AI should be able to take actions that modify the learning process including the reward function to make it less bad. I think this is an eventually-good answer that has unsolved problems: it requires even more trust that the AI is doing the right thing, and to build that trust I probably want a clearer picture of how we’d want an aligned AI to do learning anyhow.
Another angle is like selection vs control. Rather than using our judge to search over a bunch of updates, which seems like it will end up leading to gaming, is there some way to use our judge as part of a system that finds updates in a more control-y way? This is obv kind of crazy, because of the bitter lesson. The GOFAI dream of an AI made of understandable code that it can rewrite to improve itself is far from our reality. But maybe there are neuro-inspired algorithms that have good properties while still leveraging learned models?
When the actor is a delivery robot, I think its output is unsuited for gaming the judge. I mean, maybe it could write a convincing argument out on the sidewalk in theory, but there’s no curriculum to get there. Or in evolutionary terms, no variance to be selected on.
When the actor is an LLM or world model in general, it’s way better at gaming the judge. I’d expect Goodhart’s law to bite—sure, LLMs are good at detecting subtle signals, but also they can often be guided by those subtle signals in human-unintended ways.
How to beat Goodhart’s law here? One angle is to say the AI is being too restricted—it’s unfair to lock in some unsafe RL target and ask self-supervision to make it safe, the AI should be able to take actions that modify the learning process including the reward function to make it less bad. I think this is an eventually-good answer that has unsolved problems: it requires even more trust that the AI is doing the right thing, and to build that trust I probably want a clearer picture of how we’d want an aligned AI to do learning anyhow.
Another angle is like selection vs control. Rather than using our judge to search over a bunch of updates, which seems like it will end up leading to gaming, is there some way to use our judge as part of a system that finds updates in a more control-y way? This is obv kind of crazy, because of the bitter lesson. The GOFAI dream of an AI made of understandable code that it can rewrite to improve itself is far from our reality. But maybe there are neuro-inspired algorithms that have good properties while still leveraging learned models?