With regard to corrigibility, if I try to think about what I’m doing when I’m trying to be corrigible, it seems to boil down to something like this: build a model of the user based on all available information and my prior about humans, use that model to help improve my understanding of the meaning of the request, then find a course of action that best balances between satisfying the request as given, upholding (my understanding of) the user’s morals and values, and most importantly keeping the user in control. Much of this seems to depend on information (prior about humans), procedure (how to build a model of the user), and judgment (how to balance between various considerations) that are far from introspectively accessible.
I find this interesting. If I’m “Playing AI”, I would choose a way-of-being which leads to low regret for the many different kinds of agents which could have built me. So, I’d have a prior over who instantiates what kind of system, and for what purposes. Then, I imagine how well my decision procedure would do for all of these different possible agents. If I notice that my decision procedure tends to be wrong about what the creators want in situations like the present, I’ll be more deferential.
I’m not sure to what extent this overlaps your corrigibility motions.
(This feels quite similar to the motions one makes to improve their epistemics)
I find this interesting. If I’m “Playing AI”, I would choose a way-of-being which leads to low regret for the many different kinds of agents which could have built me. So, I’d have a prior over who instantiates what kind of system, and for what purposes. Then, I imagine how well my decision procedure would do for all of these different possible agents. If I notice that my decision procedure tends to be wrong about what the creators want in situations like the present, I’ll be more deferential.
I’m not sure to what extent this overlaps your corrigibility motions.
(This feels quite similar to the motions one makes to improve their epistemics)