Elliott Thornley (EJT) comments on Shutdownable Agents through POST-Agency

Elliott Thornley (EJT) 16 Nov 2025 22:27 UTC
2 points
0
With the Work or Steal example specifically, we want the agent to choose Work because that’s best for the user’s expected bank balance. With the drones example, things are different. POST-agents will start building the drone if that has highest EV, taking the probability distribution over trajectory-lengths as fixed. That’s a downside, but it’s minor so long as the POST-agent builds the drone in a cheap, overt way, because then we can just notice and shut the agent down.
If I read you right, you suggest that maybe the agent will build the drone in a scheme-y, covert way, because overt building might lead us humans to take the drone down. But covert building in this situation seems fairly unlikely to me. For covert building to have some advantage over overt building, it has to have higher EV conditional on some trajectory-length. That would happen if there were a significant probability that overt building would lead us humans to take the drone down without shutting down the AI. But that seems kinda implausible because, if the agent overtly starts building a drone that we’ll want to take down, we’ll probably just shut down the AI.
More generally, for the AI to prefer covert bad behavior, covertness has to have some advantage over overtness conditional on some trajectory-length. And—I think—it will only have that advantage if overt bad behavior would lead us humans to fight the AI but not it shut it down. But that seems unlikely. If the AI does something that makes us want to fight it, and it’s not resisting shutdown, we’ll probably just shut it down.
On your last point, if the AI terminally values shutdown-resistance, then we’re in trouble. I think plausibly impediment-avoidance would generalize to shutdown-resistance if we weren’t training the AI to have any attitudes to shutdown elsewhere in the training process. But I think if we’re training the agent to satisfy POST and Neutrality+ then I expect impediment-avoidance not to generalize to shutdown-resistance. One useful example here might be backdoors. If you just finetune your model to behave badly on a trigger like ’2024′, it might also generalize to behaving badly on a trigger like ‘2023’. But if you finetune your model to behave badly given ‘2024’ and behave well given ‘2023’, you can get the bad behavior to stay limited to the ‘2024’ trigger.