This is cruxy, because I don’t think that noise/non-error freeness alone of your observations lead to bribing surveyors unless we add in additional assumptions about what that noise/non-error freeness is.
(in particular, simple IID noise/quantum noise likely doesn’t lead to extremal Goodhart/bribing surveyors.)
More generally, the reason I maintain a difference between these 2 failure modes of goodharting, like regressional and extremal goodharting is because they respond differently to decreasing the error.
I suspect that in the limit of 0 error, regressional Goodhart like noisy sensors leading to slight overspending on reducing mosquitos vanishes, whereas extremal Goodhart like bribing surveyors doesn’t vanish Goodhart. More importantly, the error of your sensors being means there’s only a bounded error in how much you can regulate X, and error can’t dominate, while extremal Goodhart like bribing surveyors can make the error dominate.
So I basically disagree with this statement:
Goodharting is robust. That is, the mechanism of Goodharting seems impossible to overcome. Goodharting is just a fact of any control system.
Can you explain where there is an error term in AlphaGo or where an error term might appear in hypothetical model similar to AlphaGo trained much longer with much more numerous parameters and computational resources?
AlphaGo is fairly constrained in what it’s designed to optimize for, but it still has the standard failure mode of “things we forgot to encode”. So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make m(X) adequately evaluate X as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we constructed m(X) we forgot about minimizing side effects.
“Error” here is all sources of error, not just error in the measurement equipment. So bribing surveyors is a kind of error in my model.
This is cruxy, because I don’t think that noise/non-error freeness alone of your observations lead to bribing surveyors unless we add in additional assumptions about what that noise/non-error freeness is.
(in particular, simple IID noise/quantum noise likely doesn’t lead to extremal Goodhart/bribing surveyors.)
More generally, the reason I maintain a difference between these 2 failure modes of goodharting, like regressional and extremal goodharting is because they respond differently to decreasing the error.
I suspect that in the limit of 0 error, regressional Goodhart like noisy sensors leading to slight overspending on reducing mosquitos vanishes, whereas extremal Goodhart like bribing surveyors doesn’t vanish Goodhart. More importantly, the error of your sensors being means there’s only a bounded error in how much you can regulate X, and error can’t dominate, while extremal Goodhart like bribing surveyors can make the error dominate.
So I basically disagree with this statement:
(Late comment here).
Can you explain where there is an error term in AlphaGo or where an error term might appear in hypothetical model similar to AlphaGo trained much longer with much more numerous parameters and computational resources?
AlphaGo is fairly constrained in what it’s designed to optimize for, but it still has the standard failure mode of “things we forgot to encode”. So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make m(X) adequately evaluate X as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we constructed m(X) we forgot about minimizing side effects.