Elliott Thornley (EJT) comments on The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

Elliott Thornley (EJT) 9 Mar 2026 22:16 UTC
2 points
0
An example would be:

Option A: avoid shutdown and get utility 1.
Option B: get shut down now
Option C: avoid shutdown and get utility 2.
If the agent’s option set is {A, B}, then the agent is indifferent between A and B. But if the agent’s option set is {A, B, C}, then the agent is indifferent between B and C, and hence prefers B to A.
Does that help? I don’t quite understand your point about “actions chosen before the shutdown are just like those of one maximizing the still-on utility,” so I might be missing something.
- XelaP 10 Mar 2026 5:57 UTC
  1 point
  0
  Parent
  My understanding: The agent has a utility function made from two utility functions, one of which (“normal”) goes 1 0 2, while the other (“shutdown”) goes (e.g.) −1 1 −1. The way the indifference works, before the button is pressed the agent acts as if it was a normal utility agent that believed the button wouldn’t be pressed. Of course, this means the agent chooses C—but I don’t think this is what you meant, because in real life pushing the button has some cost and is only worth it if it averts your shutdown so you can go do utility increasing things later. But we should still have option set independence, as is provably the case simply because it acts like an expected utility maximizer that has a weird belief.
  I might be missing something—the main place I worry about missing something is whether Armstrong’s setup calculates the adjustment term in an option-dependent way. The adjustment term is E[U_N(a) | not pressed] - E[U_S(a) | pressed], and if there are multiple timesteps then this requires reasoning about what the agent will do later (unless you have the actions be policies). I think the option set independence should still ‘bubble up’ from the right-before-terminal nodes up to the current decision?
  (Just to make sure we’re on the same page: I’m talking about the utility function in equation 11 of https://intelligence.org/files/Corrigibility.pdf or 4-5 of https://cdn.aaai.org/ocs/ws/ws0119/10183-45890-1-PB.pdf , which define basically U(a) = U_N(a) if not pressed else U_S(a) + correction(a))