Jeremy Gillen comments on Detect Goodhart and shut down

Jeremy Gillen 23 Jan 2025 14:06 UTC
2 points
0
Yeah I agree there are similarities. I think a benefit of my approach, that I should have emphasized more, is that it’s reflectively stable (and theoretically simple and therefore easy to analyze). In your description of an AI that wants to seek clarification, it isn’t clear that it won’t self-modify (but it’s hard to tell).
There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”.
But there is a principled distinction. The distinction is whether the plan exploits differences between the goal specification and our actual goal. This is a structural difference, and we can detect using information about our actual goal.
So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception.
My proposal is usually an exception to this, because it takes advantage of the structural difference between the two cases. The trick is that the validation set only contains things that we actually want. If it were to contain extra constraints beyond what we actually want, then yeah that creates an alignment tax.