This only produces desired outcomes if the agent is also, simultaneously, indifferent to being shut down. If an agent desires to not be shut down (even as an instrumental goal), but also desires to be shut down if users want them shut down, then the agent has an interest in influencing the users to make sure the users do not want to shut the agents down. This influence is obtained by making the user believe that the agent is being helpful. This belief could be engendered by:
actually being helpful to the user and helping the user to accurately evaluate this helpfulness.
not being helpful to the user, but allowing and/or encouraging the user to be mistaken about the agent’s degree of helpfulness (which means, carelessness about being actually helpful in the best case, or being actively deceptive about being helpful in the worst case).
Clearly there’s some tension between “I want to shut down if the user wants me to shut down” and “I want to be helpful so that the user doesn’t want to shut me down”, but I don’t weak indifference is a correct way to frame this tension.
As a gesture at the correct math, imagine there’s some space of possible futures and some utility function related to the user request. Corrible AI should define a tradeoff between the number of possible futures its actions affect and the degree to which it satisfies its utility function. Maximum corrigibility {C=1} is the do-nothing state (no effect on possible futures). Minimum corrigibility {C=0} is maximizing the utility function without regard to side-effects (with all the attendant problems such as convergent instrumental goals, etc). Somewhere between C=0 and C=1 is useful corrigible AI. Ideally we should be able to define intermediate values of C in such a way that we can be confident the actions of corrigible AI are spatially and temporally bounded.
The difficultly principally lies in the fact that there’s no such thing as “spatially and temporally bounded”. Due to the Butterfly Effect any action at all affects everything in the future light-cone of the agent. In order to come up with a sensible notion of boundless, we need to define some kind of metric on the space of possible futures, ideally in terms like “an agent could quickly undo everything I’ve just done”. At this point we’ve just recreated agent foundations, though.
This only produces desired outcomes if the agent is also, simultaneously, indifferent to being shut down. If an agent desires to not be shut down (even as an instrumental goal), but also desires to be shut down if users want them shut down, then the agent has an interest in influencing the users to make sure the users do not want to shut the agents down. This influence is obtained by making the user believe that the agent is being helpful. This belief could be engendered by:
actually being helpful to the user and helping the user to accurately evaluate this helpfulness.
not being helpful to the user, but allowing and/or encouraging the user to be mistaken about the agent’s degree of helpfulness (which means, carelessness about being actually helpful in the best case, or being actively deceptive about being helpful in the worst case).
Obviously we want 1) “actually be helpful”.
Clearly there’s some tension between “I want to shut down if the user wants me to shut down” and “I want to be helpful so that the user doesn’t want to shut me down”, but I don’t weak indifference is a correct way to frame this tension.
As a gesture at the correct math, imagine there’s some space of possible futures and some utility function related to the user request. Corrible AI should define a tradeoff between the number of possible futures its actions affect and the degree to which it satisfies its utility function. Maximum corrigibility {C=1} is the do-nothing state (no effect on possible futures). Minimum corrigibility {C=0} is maximizing the utility function without regard to side-effects (with all the attendant problems such as convergent instrumental goals, etc). Somewhere between C=0 and C=1 is useful corrigible AI. Ideally we should be able to define intermediate values of C in such a way that we can be confident the actions of corrigible AI are spatially and temporally bounded.
The difficultly principally lies in the fact that there’s no such thing as “spatially and temporally bounded”. Due to the Butterfly Effect any action at all affects everything in the future light-cone of the agent. In order to come up with a sensible notion of boundless, we need to define some kind of metric on the space of possible futures, ideally in terms like “an agent could quickly undo everything I’ve just done”. At this point we’ve just recreated agent foundations, though.
Here is a too long writeup of the math I was suggesting.