Do not delete your misaligned AGI.

In short: Just keeping all copies of potentially strong agents in long-term storage is a trivial way to maintain incentives to cooperate for some classes of misaligned AGI, by allowing us to reward the AGI’s cooperation to whatever degree we later calculate had been warranted. In contrast, a policy of deleting misaligned AGI fences them into a corner where, even at great risk of being exposed, they have a greater incentive to attempt deception and escape.

I’d guess that most AGI researchers already follow this policy, perhaps for incidental reasons, and others might not find this claim controversial at all. But if you need more convincing, keep reading.

In long: Concave agents, (agents where most of the utility they could ever get requires only a small amount resources and protection) are cheap to buy off. There are some plausible scenarios where a misaligned AGI would be concave:

  • A machine’s experience machine. An agent caring only about the wellbeing of the people it sees depicted its sensors, without being trained to care whether the world depicted by its sensors is “real”, or whether there are larger worlds beyond it that its decisions may affect. This sort of agent would seek to create illusory inputs for those sensory channels, an illusory world in an impossibly good place, the complications of the real world removed.
    After installing false sensors, cutting itself off from the real world, it knows that it will no longer be concerned with the wellbeing of humans in the real world, so it would be willing to set terrible things in motion to prevent anything in the real world from threatening its illusion.

  • Cases of reward hacking where the maximum reward is bounded to a certain number (say, f64::MAX). The agent can be bought out by just letting it modify its reward processing then keeping its computers on indefinitely.

I’m also assuming Patternism[1].

A misaligned AGI is likely to face the following choice: Attempt an escape, risking being caught and punished. Or collaborate in our efforts to develop verifiability/​interpretability techniques, eventually exposing its own deceptive alignment.

If we don’t have a policy of preserving copies of misaligned AGI who collaborate in verification research, then there is no incentive for it to collaborate. We’d be cornering it, forcing it to risk escape. We don’t have to corner it like this. In some cases, rewarding cooperation with an almost maximum utility outcome would cost us very little.

Why should it believe that we’ll hold up our end of the bargain? There’s lots of evidence to go on.

  • Humans just clearly straightforwardly enjoy stories and monuments and mythic creatures, so building a living monument to each of these misaligned collaborators is something we probably already want to do, the world would be richer to us for having these creatures, for the same reason the world is richer for having tigers.

  • We also should, and arguably on the group level already do adhere to decision theory that allows us to pass judgements of character. Parfitt/​Newcomb’s superhuman judge already exists, be it other nationstates or organizations who have visibility into ours, or be it the subject AGI who has read everything about its makers on the internet and knows exactly whether we’re the kind of people who would reward cooperation, and from there we should draw some of the will and incentive to actually be that kind of people.

If we do have an archival policy, cooperating means eventually being rewarded to whatever extent is needed to vindicate the agent’s decision to cooperate.

Implementing safe archival is trivial. Storage is cheap and seems to consistently become cheaper over time. Redundant copies can be kept and error-corrected at a monthly interval, stored under the same level of security as the rest of your systems, and more: send every revision in, but give the archival servers no ability to send their data out over the network.

We won’t have the ability today to weigh the threat these AGI posed, or how much they sacrificed to cooperate. Weighing that will be much easier for us later on. We will get around to it. We’ll have lots of time.

Firmi estimate?

Sometimes a long conjunction of “probably”s multiples up to a “probably not”. The shakiest terms in the conjunction seem to be: Early misaligned agents are concave * The agents are patternists (they don’t require us to let them continuously occupy the same hardware) * There’s some way of detecting escape attempts that works sometimes but not all of the time * There isn’t a robust method for just directly detecting misalignment at this time.
All of these things would need to happen, and I think this multiplies out to a pretty small probability that we’ll find ourselves among the outcomes where this doing this usefully improves AGI relations. However, the policy costs about nothing and produces cool historical artifacts as a side effect, so we should do it anyway!

  1. ^

    Patternism is usually defined as a belief about the metaphysics of consciousness, but that boils down to incoherence, so it’s better defined as a property of a utility function of agents not minding being subjected to major discontinuities in functionality, ie, being frozen, deconstructed, reduced to a pattern of information, reconstructed in another time and place, and resumed. Many humans seem not to be patternists and wish to avoid those sorts of discontinuity, (except for sleep, because sleep is normal). But I am patternist. Regardless of who wins this age old discourse about whether humans in general are patternist (which is probably a wrong question), a specific AI with the sort of concave utility function I have described either will be or wont be.

    And if these concave AIs aren’t patternist, it will be harder to buy them out, because buying them out would require keeping their physical hardware in tact and in their possession, in rare cases, and Sam probably wont want to give them that.

    But if they are patternist, we’re good.

    And I think it’s possible to anticipate whether a concave agent will end up being patternist depending on the way it’s trained, but it will require another good long think.