Another formalization attempt: Central Argument That AGI Presents a Global Catastrophic Risk

Chalmers recently requested a formalized argument why AI is catastrophic risk. Here is my attempt to write it (not polished).

Short version

1. Future AI will be very powerful soon.

2. Future AI systems will likely be not exactly aligned with human values.

3. Harm from AI is proportional to (capabilities)x(misalignment).

4. Based on 1, 2 and 3 we can conclude that AI will cause catastrophic harm soon.

5. Catastrophic harm is an existential risk to humanity, and it for sure includes disempowerment, and likely includes extinction or eternal sufferings.

6. There are many ways how AI can cause catastrophic harm.

More detailed version

1) Future AI will be very powerful soon because of Moore’s law, the stream of new ideas and global and local self-improvement. This will happen relatively soon, because of the exponential nature of the global self-improving process. Also, because global AI arms race will favour such dynamics.

Assumptions:

a) Powerful AI is possible.

b) Powerful AI will appear relatively soon before AI control theory and practice will be developed.

2) Future AI systems will likely be not exactly aligned with human values, as we don’t know what is human values, and how to instil any values at all in AI, and there are several other reasons (whose values we installing, internal misalignment).

Lemma: AI Alignment is difficult

(Needs to be proved separately, see below)

3) Harm from AI is proportional to (capabilities)x(misalignment).

The relation is not necessarily linear as capabilities growth can cause misalignment (ontological crisis, sharp left turn).

Here is also assumed that AI will be used for actions. If AI is not acting, its capabilities are not dangerous themselves.

4) Based on 1, 2 and 3 we can conclude that AI will cause catastrophic harm soon.

In the equation

Harm = (capabilities)x(misalignment)

The capability part will grow very large based on (1) and misalignment will be not zero based on (2). As AI’s capabilities can grow infinitely large, harm can reach the maximum possible level.

5. Catastrophic harm is an existential risk to humanity, and its main form is disempowerment, though extinction or eternal sufferings are possible outcomes.

Typically maximum possible harm is defined as existential risk, which could take the form of:

- human extinction,

- s-risk (eternal suffering, but some humans survive)

- disempowerment (some humans will survive but don’t control their fate).

Note that disempowerment is a prerequisite for any catastrophic harm, as it means that humans can’t resist bad things.

6. There are many ways how AI can cause catastrophic harm. (AI used as a tool, Paperclipper, wrongly aligned AI, a war between AIs, AI halting).

There is no necessity that (Singleton) AI will kill all humans, but it still can happen. AI will kill humans for two reasons:

A) to prevent a threat to itself. But AI can kill humans only after it develops human-independent robotic infrastructure (HIRI), presumably based on nanotech. But humans can’t destroy nanotech, so there is no reason to kill humans either before not after HIRI.

B) for some marginal utilitarian reasons: for atoms or via ecology damage

There are several reasons why the utility of preserving humans may turn higher than human atoms: acausal trade with aliens, humans in simulations, and humans doing some heavy work. But the relation between utilities is fragile and uncertain.

Lemma: Why alignment is hard

Definition: “Alignment” is a form of equivalence between two similar objects, e.g. parallel lines.

Definition: “AI alignment” is a form of equivalence between the human value system and the goal system of AI.

Alignment becomes more difficult when two objects are less similar, more remote and less linear, e.g. mouse and clouds.

Humans and advanced AI are remote, not similar and not linear objects, and thus alignment is difficult:

a) Humans’ non-linearity. Human values are fuzzy concept, more like a cloud (complexity of values)

b) AI quantitative distance from humans. As AI will become superhuman, it will be more distant from humans (sharp left turn). It is difficult to align objects of different scales.

c) AI qualitative distance from humans. As AI thinks differently than humans, it could act differently in new situations.