This seems interesting. I’d qualify though as not solving Goodhart’s Curse but limiting the damage done by it. There will still be non-zero Goodharting, but it will hopefully be an insufficient amount of Goodharting to result in bad outcomes as a result of Goodharting (may still get bad outcomes for other reasons).
There’s an additional practical challenge with this approach (which I think the original quantilizer approach shares), which is that it will always be tempting to get additional gains by allowing an agent to optimize a little bit more than it currently is. The incremental damage done won’t be too great, and it will only be too late once we realize we’ve crossed a threshold where optimization has resulted in Goodharting to bad effect, so it seems even if this works in theory we also need to address the practical concern of how to coordinate to limit the optimization power we build into AI.
(This second point is not meant to be an objection or complaint, more thinking out loud about future research directions after establishing if this approach is sufficient.)
Good point, policies that have upward errors will still be preferentially selected for (a little). However, with this approach, the amount of Goodharting should be constant as the proxy quality (and hence optimization power) scales up.
I agree with your second point, although I think there’s a slight benefit over original quantilizers because q is set theoretically, rather than arbitrarily by hand. Hopefully this makes it less tempting to mess with it.
This seems interesting. I’d qualify though as not solving Goodhart’s Curse but limiting the damage done by it. There will still be non-zero Goodharting, but it will hopefully be an insufficient amount of Goodharting to result in bad outcomes as a result of Goodharting (may still get bad outcomes for other reasons).
There’s an additional practical challenge with this approach (which I think the original quantilizer approach shares), which is that it will always be tempting to get additional gains by allowing an agent to optimize a little bit more than it currently is. The incremental damage done won’t be too great, and it will only be too late once we realize we’ve crossed a threshold where optimization has resulted in Goodharting to bad effect, so it seems even if this works in theory we also need to address the practical concern of how to coordinate to limit the optimization power we build into AI.
(This second point is not meant to be an objection or complaint, more thinking out loud about future research directions after establishing if this approach is sufficient.)
Good point, policies that have upward errors will still be preferentially selected for (a little). However, with this approach, the amount of Goodharting should be constant as the proxy quality (and hence optimization power) scales up.
I agree with your second point, although I think there’s a slight benefit over original quantilizers because q is set theoretically, rather than arbitrarily by hand. Hopefully this makes it less tempting to mess with it.