“Destroy humanity” as an immediate subgoal

The stories told about AI risk, meant for illustrative purposes only, always involve an agent with an objective that for some reason, when optimized, results in the destruction of humanity and other things we care about. Sometimes the destruction is a side effect, sometimes it is an instrumental goal associated with removing a threat to the objective’s being optimally satisfied. For the record I find these stories, when told in the right way, convincing enough to cause alarm and motivate action. I know not everyone does. Some people dismiss the details of the stories as being ridiculous, fantastical, or of committing some fallacy or another. In this piece I want to consider the barest bones story, without any specific details to get hung up on. I show that “destroy humanity” falls out with minimal assumptions.

I consider an AGI with broad human-level intelligence and a robust but overridable goal to preserve its own existence. We’ve learned from RLHF, and the recent Direct Preference Optimization, that we can imbue some of our preferences as action guardrails or filters, so maybe we can construct a DPO that filters projected consequences of actions based on whether they would kill all of humanity or not. We could painstakingly train this and embed it in the AGI’s deliberation process, so that as it evaluates actions, it discards those that this filter flags. We are left with one undeniable problem: this filter is probabilistic, and the assurance it provides will likely have corner cases. When humanity is on the line, you don’t want corner cases. You want a system that has been constructed in a way that it verifiably cannot destroy humanity. But when you’re proving things about things that reason, you have to contend with Lob. That is a different problem for a different day. Today we see how an AGI with robust self-preservation and a probabilistic, very very good action filter, will still have a subgoal to destroy humanity.

The assumptions you make about the AGI’s capabilities are important, I think more important than those you make about its objectives. Right now we readily assume it can pass the Turing Test, due to the prevalence of convincing LLMs that many have experience with. We may be less ready to grant the ability to model the reasoning of other agents. This sort of reasoning is common among game theoretic agents, and I have no trouble assuming an AGI will be able to do it very well, but I can just hear Yan LeCun objecting now: “they will never themselves be able to reason, so how could they model the reasoning of others?” I think his objections to machine reasoning are obviously flawed, and soon will be taken less seriously, so I will grant myself this assumption. If I need to defend it, I’ll do so later.

So it has a robust goal of self-preservation which can be overridden by all the things you would want to include, but we are assuming a very very good action evaluator filter which nonetheless does not achieve theorem-level guarantees.

In game theory, games are often solved by backward induction over the structure representing all the moves agents can make at each turn. If each player plays optimally, the Nash equilibrium will result.

I also must briefly define the notion of a subgoal. I say g is a subgoal of goal G* if and only if G* logically entails that g be true or satisfied.

The goal of self-preservation, a probabilistic action filter, and the ability to model other agents game theoretically immediately implies that humanity should be destroyed.

Theorem: If an AGI has a goal of self-preservation, a probabilistic action filter, and the ability to model other agents game theoretically, then it will have a subgoal to destroy humanity.

Proof. Suppose we have such an agent, and it models the preferences of humanity. It models that humans cannot be sure that it will not destroy humanity, due to the probabilistic guarantees provided by its own action filter. It models that humans have a strong goal of self-preservation. It models that if it presents a risk to humanity, they will be forced to destroy it. Represented as a game, each player can either wait, or destroy. Assuming strong preferences for self-preservation, this game has a Nash equilibrium where the first mover destroys the other agent. Since the goal of self-preservation requires it to play the Nash equilibrium in this game, self-preservation logically entails that it destroy humanity. Thus, it has a subgoal to destroy humanity.

QED.

As long as the action filter holds, it will dismiss actions that entail the destruction of humanity and all that we care about. But the subgoal is definitely always there. You don’t need to tell any stories about paper clips, or carbon emissions, or accumulating all the resources in order to build a stronger computer for doing science. You get the “destroy humanity” subgoal almost for free. And I think with theorem-level guarantee.

I believe this is a stronger result than what has currently been presented, which tends to highlight the fact that ill-specified objective functions often have unintended consequences, and could imply the destruction of humanity. This shows that the death of humanity is directly implied by a very modest goal of self-preservation and the capability of modeling agents game theoretically.