Does this mean that you expect we will be able to build advanced AI that doesn’t become an expected utility maximizer?
When talking about whether some physical system “is a utility maximizer”, the key questions are “utility over what variables?”, “in what model do those variables live?”, and “with respect to what measuring stick?”. My guess is that a corrigible AI will be a utility maximizer over something, but maybe not over the AI-operator interface itself? I’m still highly uncertain what that type-signature will look like, but there’s a lot of degrees of freedom to work with.
Do you think the current interpretability approaches will basically get us there or will we need qualitatively different methods?
We’ll need qualitatively different methods. But that’s not new; interpretability researchers already come up with qualitatively new methods pretty regularly.
When talking about whether some physical system “is a utility maximizer”, the key questions are “utility over what variables?”, “in what model do those variables live?”, and “with respect to what measuring stick?”. My guess is that a corrigible AI will be a utility maximizer over something, but maybe not over the AI-operator interface itself? I’m still highly uncertain what that type-signature will look like, but there’s a lot of degrees of freedom to work with.
We’ll need qualitatively different methods. But that’s not new; interpretability researchers already come up with qualitatively new methods pretty regularly.