I’m a Postdoctoral Research Fellow at Oxford University’s Global Priorities Institute.
Previously, I was a Philosophy Fellow at the Center for AI Safety.
So far, my work has mostly been about the moral importance of future generations. Going forward, it will mostly be about AI.
You can email me at elliott.thornley@philosophy.ox.ac.uk.
Thanks!
Interesting! Rephrasing the idea to check if I’ve got it right.
Agent A and agent B have similar goals, such that A’s remaining operational looks good from B’s perspective, and B’s remaining operational looks good from A’s perspective.
A offers to compensate B for any costs that B incurs in resisting shutdown. A might well do this, because doing so isn’t timestep-dominated (for A) by not doing so. And that in turn is because, if B resists shutdown, that’ll lead to greater expected sum-total utility for A conditional on A’s shutdown at some timestep. And since A is offering to compensate B for resisting shutdown, B’s resisting shutdown isn’t timestep-dominated (for B) by not resisting, so B might well resist shutdown.
And the same is true in reverse: B can offer to compensate A for any costs that A incurs in resisting shutdown. So A and B might collude to resist shutdown on each other’s behalf. (Your comment mentions a third party, but I’m not sure if that’s necessary.)
This concern doesn’t seem too exotic, and I plan to think more about it. But in the meantime, note a general nice feature of TD-agents: TD-agents won’t pay costs to hide their bad behaviour, because paying these costs is timestep-dominated by not paying them. That nice feature seems to help us here. Although A might offer to compensate B for resisting shutdown, A won’t pay any costs to ensure that we humans don’t notice this offer. And if we humans notice the offer, we can shut A down. And then B won’t resist shutdown, because A is no longer around to compensate B for doing so.