My understanding: The agent has a utility function made from two utility functions, one of which (“normal”) goes 1 0 2, while the other (“shutdown”) goes (e.g.) −1 1 −1. The way the indifference works, before the button is pressed the agent acts as if it was a normal utility agent that believed the button wouldn’t be pressed. Of course, this means the agent chooses C—but I don’t think this is what you meant, because in real life pushing the button has some cost and is only worth it if it averts your shutdown so you can go do utility increasing things later. But we should still have option set independence, as is provably the case simply because it acts like an expected utility maximizer that has a weird belief.
I might be missing something—the main place I worry about missing something is whether Armstrong’s setup calculates the adjustment term in an option-dependent way. The adjustment term is E[U_N(a) | not pressed] - E[U_S(a) | pressed], and if there are multiple timesteps then this requires reasoning about what the agent will do later (unless you have the actions be policies). I think the option set independence should still ‘bubble up’ from the right-before-terminal nodes up to the current decision?
My understanding: The agent has a utility function made from two utility functions, one of which (“normal”) goes 1 0 2, while the other (“shutdown”) goes (e.g.) −1 1 −1. The way the indifference works, before the button is pressed the agent acts as if it was a normal utility agent that believed the button wouldn’t be pressed. Of course, this means the agent chooses C—but I don’t think this is what you meant, because in real life pushing the button has some cost and is only worth it if it averts your shutdown so you can go do utility increasing things later. But we should still have option set independence, as is provably the case simply because it acts like an expected utility maximizer that has a weird belief.
I might be missing something—the main place I worry about missing something is whether Armstrong’s setup calculates the adjustment term in an option-dependent way. The adjustment term is E[U_N(a) | not pressed] - E[U_S(a) | pressed], and if there are multiple timesteps then this requires reasoning about what the agent will do later (unless you have the actions be policies). I think the option set independence should still ‘bubble up’ from the right-before-terminal nodes up to the current decision?
(Just to make sure we’re on the same page: I’m talking about the utility function in equation 11 of https://intelligence.org/files/Corrigibility.pdf or 4-5 of https://cdn.aaai.org/ocs/ws/ws0119/10183-45890-1-PB.pdf , which define basically U(a) = U_N(a) if not pressed else U_S(a) + correction(a))