But… why can’t I apply the argument to “powerful”,
Sticking with the ML paradigm, I can easily think of loss functions which are minimized by being powerful, like “earn as much money as possible”, but I can’t think of any loss function which is minimized by being corrigible.
For the latter, the challenge is that, for any “normal” loss function, corrigible and deceptive agents can score the same loss by taking the same actions (albeit for different reasons).
It would have to be an unusual kind of loss function, presumably one that peers inside the model using transparency tools to infer motivations, for it to be minimized only by corrigible agents. I don’t know how to write such a loss function but I think it would be a huge step forward if someone figured it out. :-)
Sticking with the ML paradigm, I can easily think of loss functions which are minimized by being powerful, like “earn as much money as possible”, but I can’t think of any loss function which is minimized by being corrigible.
For the latter, the challenge is that, for any “normal” loss function, corrigible and deceptive agents can score the same loss by taking the same actions (albeit for different reasons).
It would have to be an unusual kind of loss function, presumably one that peers inside the model using transparency tools to infer motivations, for it to be minimized only by corrigible agents. I don’t know how to write such a loss function but I think it would be a huge step forward if someone figured it out. :-)