Good point, policies that have upward errors will still be preferentially selected for (a little). However, with this approach, the amount of Goodharting should be constant as the proxy quality (and hence optimization power) scales up.
I agree with your second point, although I think there’s a slight benefit over original quantilizers because q is set theoretically, rather than arbitrarily by hand. Hopefully this makes it less tempting to mess with it.
Good point, policies that have upward errors will still be preferentially selected for (a little). However, with this approach, the amount of Goodharting should be constant as the proxy quality (and hence optimization power) scales up.
I agree with your second point, although I think there’s a slight benefit over original quantilizers because q is set theoretically, rather than arbitrarily by hand. Hopefully this makes it less tempting to mess with it.