Can we prevent such an agent from having a preference to create agents that do resist shutdown?
EDIT: And if they’re going to create agents anyway, actually make sure those agents don’t resist shutdown, too, rather than, say, being indifferent about whether those other agents resist shutdown.
Ensuring that agents always prefer to create shutdownable subagents is hard. But a couple of points. First:
Note a nice feature of artificial agents. Making perfect copies is easy: as easy as copying and pasting a long list of parameters. If our original agent makes new agents by copying itself, these new agents will also be neutral and shutdownable. Making imperfect, non-neutral copies is hard. Our original agent would have to find some way to change the parameters of the new agents in a way that makes these agents non-neutral but preserves their capabilities. That would likely involve setting up some training process, in which case our original agent faces its own alignment problem in getting these new agents to do what it wants (Salib, 2024). It will also be much easier for us to watch out for (Greenblatt & Shlegeris, 2024) and train against the production of new non-neutral agents, as compared to us having to watch out for and train against every possible kind of shutdown-resistance.
Second, POST-agents won’t pay costs to shift probability mass away from earlier shutdowns. So if they try to create subagents, they won’t pay any costs to do so covertly. So likely these attempts will be easy for us humans to notice and intervene on. (I make a similar point in the section on managing the news.)
Cool direction and results!
Can we prevent such an agent from having a preference to create agents that do resist shutdown?
EDIT: And if they’re going to create agents anyway, actually make sure those agents don’t resist shutdown, too, rather than, say, being indifferent about whether those other agents resist shutdown.
Thanks!
Ensuring that agents always prefer to create shutdownable subagents is hard. But a couple of points. First:
Second, POST-agents won’t pay costs to shift probability mass away from earlier shutdowns. So if they try to create subagents, they won’t pay any costs to do so covertly. So likely these attempts will be easy for us humans to notice and intervene on. (I make a similar point in the section on managing the news.)