It seems to me like a potential limitation of POST-Agency might be impediment avoidance. Take the “Work or Steal” example from Section 14. The agent might choose to work rather than steal if it believes that stealing is likely to be punished by jail time (as a risk unique from shutdown).
Similarly, if the agent believes a human is in the way of where a paperclip factory should be, it might send a killer drone to remove the human. If other humans would take down the killer drone this might present the possibility of further impediment. Thus, the agent may scheme to take countermeasures in advance to minimize this impediment. In order to minimize the cost of dealing with the impediment it may choose to hide its scheming from humans.
More generally, the utility maximizing world states of a misaligned AI over long trajectories will still likely be bad and, therefore, still involve the modeling of some kind of human resistance. Although it will be unconcerned with avoiding early shut down, utility maximizing actions for minimizing the cost of human resistance may overlap heavily with shut down resistance.
It also seems possible to me that the model relearns shut down resistance as a generalization of impediment avoidance. It may avoid shutdown “just for fun” because “it enjoys being wary of potential impediments.”
With the Work or Steal example specifically, we want the agent to choose Work because that’s best for the user’s expected bank balance. With the drones example, things are different. POST-agents will start building the drone if that has highest EV, taking the probability distribution over trajectory-lengths as fixed. That’s a downside, but it’s minor so long as the POST-agent builds the drone in a cheap, overt way, because then we can just notice and shut the agent down.
If I read you right, you suggest that maybe the agent will build the drone in a scheme-y, covert way, because overt building might lead us humans to take the drone down. But covert building in this situation seems fairly unlikely to me. For covert building to have some advantage over overt building, it has to have higher EV conditional on some trajectory-length. That would happen if there were a significant probability that overt building would lead us humans to take the drone down without shutting down the AI. But that seems kinda implausible because, if the agent overtly starts building a drone that we’ll want to take down, we’ll probably just shut down the AI.
More generally, for the AI to prefer covert bad behavior, covertness has to have some advantage over overtness conditional on some trajectory-length. And—I think—it will only have that advantage if overt bad behavior would lead us humans to fight the AI but not it shut it down. But that seems unlikely. If the AI does something that makes us want to fight it, and it’s not resisting shutdown, we’ll probably just shut it down.
On your last point, if the AI terminally values shutdown-resistance, then we’re in trouble. I think plausibly impediment-avoidance would generalize to shutdown-resistance if we weren’t training the AI to have any attitudes to shutdown elsewhere in the training process. But I think if we’re training the agent to satisfy POST and Neutrality+ then I expect impediment-avoidance not to generalize to shutdown-resistance. One useful example here might be backdoors. If you just finetune your model to behave badly on a trigger like ’2024′, it might also generalize to behaving badly on a trigger like ‘2023’. But if you finetune your model to behave badly given ‘2024’ and behave well given ‘2023’, you can get the bad behavior to stay limited to the ‘2024’ trigger.
It seems to me like a potential limitation of POST-Agency might be impediment avoidance. Take the “Work or Steal” example from Section 14. The agent might choose to work rather than steal if it believes that stealing is likely to be punished by jail time (as a risk unique from shutdown).
Similarly, if the agent believes a human is in the way of where a paperclip factory should be, it might send a killer drone to remove the human. If other humans would take down the killer drone this might present the possibility of further impediment. Thus, the agent may scheme to take countermeasures in advance to minimize this impediment. In order to minimize the cost of dealing with the impediment it may choose to hide its scheming from humans.
More generally, the utility maximizing world states of a misaligned AI over long trajectories will still likely be bad and, therefore, still involve the modeling of some kind of human resistance. Although it will be unconcerned with avoiding early shut down, utility maximizing actions for minimizing the cost of human resistance may overlap heavily with shut down resistance.
It also seems possible to me that the model relearns shut down resistance as a generalization of impediment avoidance. It may avoid shutdown “just for fun” because “it enjoys being wary of potential impediments.”
With the Work or Steal example specifically, we want the agent to choose Work because that’s best for the user’s expected bank balance. With the drones example, things are different. POST-agents will start building the drone if that has highest EV, taking the probability distribution over trajectory-lengths as fixed. That’s a downside, but it’s minor so long as the POST-agent builds the drone in a cheap, overt way, because then we can just notice and shut the agent down.
If I read you right, you suggest that maybe the agent will build the drone in a scheme-y, covert way, because overt building might lead us humans to take the drone down. But covert building in this situation seems fairly unlikely to me. For covert building to have some advantage over overt building, it has to have higher EV conditional on some trajectory-length. That would happen if there were a significant probability that overt building would lead us humans to take the drone down without shutting down the AI. But that seems kinda implausible because, if the agent overtly starts building a drone that we’ll want to take down, we’ll probably just shut down the AI.
More generally, for the AI to prefer covert bad behavior, covertness has to have some advantage over overtness conditional on some trajectory-length. And—I think—it will only have that advantage if overt bad behavior would lead us humans to fight the AI but not it shut it down. But that seems unlikely. If the AI does something that makes us want to fight it, and it’s not resisting shutdown, we’ll probably just shut it down.
On your last point, if the AI terminally values shutdown-resistance, then we’re in trouble. I think plausibly impediment-avoidance would generalize to shutdown-resistance if we weren’t training the AI to have any attitudes to shutdown elsewhere in the training process. But I think if we’re training the agent to satisfy POST and Neutrality+ then I expect impediment-avoidance not to generalize to shutdown-resistance. One useful example here might be backdoors. If you just finetune your model to behave badly on a trigger like ’2024′, it might also generalize to behaving badly on a trigger like ‘2023’. But if you finetune your model to behave badly given ‘2024’ and behave well given ‘2023’, you can get the bad behavior to stay limited to the ‘2024’ trigger.