In button-corrigibility, there is a quite different main safety concern: we want to prevent the agent from taking actions that manipulate human values or human actions in some bad way, with this manipulation creating conditions that make it easier for the agent to get or preserve a high utility score.
I generally think of solving this problem via ensuring your agent is myopic rather than via having a specific ritual which the agent is indifferent to, as you describe. It seems to me like even if you could make the agent indifferent to the shutdown button, there would still be other ways for it to influence you into giving it higher reward. Thus, it seems like what you really want here is some guarantee that the agent is only considering its per-episode reward and doesn’t care at all about its cross-episode reward, which is the condition I call myopia. That being said, how to reliably produce myopic agents is still a very open problem.
It seems to me like even if you could make the agent indifferent to the shutdown button, there would still be other ways for it to influence you into giving it higher reward.
Yes this is true: basically the fundamental job of the button-corrigible agent is still to perform a search over all actions, and find those actions that best maximise the reward computed by the utility function. As mentioned in the paper, such actions may be to start a hobby club, where humans come together to have fun and build petrol engines that will go into petrol cars. Starting a hobby club is definitely an action that influences humans, but not in a bad way.
Button-corrigibility intends to support a process where bad influencing behaviour can be corrected when it is identified, without the agent fighting back. The assumption is that we are not smart enough yet to construct an agent that is perfect on such metrics when we start it up initially.
On myopia: this is definately a technique that might make agents safer, though I would worry if this implies that an agent is incapable of considering long term effects when it chooses actions: this would make the agent unsafe in many cases.
Short term, I think myopia is a useful safety technique for many types of agents. Long term, if we have agents that can modify themselves or build new sub-agents, myopia has a big open problem: how do we ensure that any successor agents will have exactly the same myopia? Button-corrigibility avoids having to solve this preservation of ignorance problem: it also works for agents and successor agents that are maximally perceptive and intelligent, because it modifies the agent goals, not agent perception or reasoning capability. The successor agent building problem does not go away completely through: button-corrigibility still has some hairy problems with agent modification and sub-agents, but overall I have found these to be more tractable than the problem of ignorance preservation.
I generally think of solving this problem via ensuring your agent is myopic rather than via having a specific ritual which the agent is indifferent to, as you describe. It seems to me like even if you could make the agent indifferent to the shutdown button, there would still be other ways for it to influence you into giving it higher reward. Thus, it seems like what you really want here is some guarantee that the agent is only considering its per-episode reward and doesn’t care at all about its cross-episode reward, which is the condition I call myopia. That being said, how to reliably produce myopic agents is still a very open problem.
Yes this is true: basically the fundamental job of the button-corrigible agent is still to perform a search over all actions, and find those actions that best maximise the reward computed by the utility function. As mentioned in the paper, such actions may be to start a hobby club, where humans come together to have fun and build petrol engines that will go into petrol cars. Starting a hobby club is definitely an action that influences humans, but not in a bad way.
Button-corrigibility intends to support a process where bad influencing behaviour can be corrected when it is identified, without the agent fighting back. The assumption is that we are not smart enough yet to construct an agent that is perfect on such metrics when we start it up initially.
On myopia: this is definately a technique that might make agents safer, though I would worry if this implies that an agent is incapable of considering long term effects when it chooses actions: this would make the agent unsafe in many cases.
Short term, I think myopia is a useful safety technique for many types of agents. Long term, if we have agents that can modify themselves or build new sub-agents, myopia has a big open problem: how do we ensure that any successor agents will have exactly the same myopia? Button-corrigibility avoids having to solve this preservation of ignorance problem: it also works for agents and successor agents that are maximally perceptive and intelligent, because it modifies the agent goals, not agent perception or reasoning capability. The successor agent building problem does not go away completely through: button-corrigibility still has some hairy problems with agent modification and sub-agents, but overall I have found these to be more tractable than the problem of ignorance preservation.