I’ve only read your intro and skimmed a couple of sections, so there’s a good chance you address my thought in here somewhere. If so, sorry!
What I’m wondering is how this would handle the case of ‘resisting shutdown’ being a very subtle and costless action. For instance, if the model in question were superhuman at persuasion and modeling its users, and realized that it could complete a sentence either in a way which achieved progress towards its current goal, or in a way that achieved the same amount of progress and also left the user more reluctant to press the shutdown button. And if many such opportunities presented themselves, wouldn’t such a model end up manipulating its users into not shutting it down, while never expending any ‘extra’ effort in that direction?
Good question. I discuss costless shutdown-prevention a bit in footnote 21 and section 21.4. What I say there is: if shutdown-prevention is truly costless, then the agent won’t prefer not to do it, but plausibly we humans can find some way to set things up so that shutdown-prevention is always at least a little bit costly.
Your example suggests that maybe this won’t always be possible. But here’s some consolation. If the agent satisfies POST, it won’t prefer not to costlessly prevent shutdown, but it also won’t prefer to costlessly prevent shutdown. It’ll lack a preference, and so choose stochastically. So if the agent should happen to have many costless opportunities to affect the probabilities of shutdown at each timestep, it won’t reliably choose to delay shutdown rather than hasten it.
I’ve only read your intro and skimmed a couple of sections, so there’s a good chance you address my thought in here somewhere. If so, sorry!
What I’m wondering is how this would handle the case of ‘resisting shutdown’ being a very subtle and costless action. For instance, if the model in question were superhuman at persuasion and modeling its users, and realized that it could complete a sentence either in a way which achieved progress towards its current goal, or in a way that achieved the same amount of progress and also left the user more reluctant to press the shutdown button. And if many such opportunities presented themselves, wouldn’t such a model end up manipulating its users into not shutting it down, while never expending any ‘extra’ effort in that direction?
Good question. I discuss costless shutdown-prevention a bit in footnote 21 and section 21.4. What I say there is: if shutdown-prevention is truly costless, then the agent won’t prefer not to do it, but plausibly we humans can find some way to set things up so that shutdown-prevention is always at least a little bit costly.
Your example suggests that maybe this won’t always be possible. But here’s some consolation. If the agent satisfies POST, it won’t prefer not to costlessly prevent shutdown, but it also won’t prefer to costlessly prevent shutdown. It’ll lack a preference, and so choose stochastically. So if the agent should happen to have many costless opportunities to affect the probabilities of shutdown at each timestep, it won’t reliably choose to delay shutdown rather than hasten it.