New report: A review of the empirical evidence for existential risk from AI via misaligned power-seeking

Link post

Visiting researcher Rose Hadshar recently published a review of some evidence for existential risk from AI, focused on empirical evidence for misalignment and power seeking. (Previously from this project: a blogpost outlining some of the key claims that are often made about AI risk, a series of interviews of AI researchers, and a database of empirical evidence for misalignment and power seeking.)

In this report, Rose looks into evidence for:

  • Misalignment,[1] where AI systems develop goals which are misaligned with human goals; and

  • Power-seeking,[2] where misaligned AI systems seek power to achieve their goals.

Rose found the current state of this evidence for existential risk from misaligned power-seeking to be concerning but inconclusive:

There is empirical evidence of AI systems developing misaligned goals (via specification gaming[3] and via goal misgeneralization[4]), including in deployment (via specification gaming), but it’s not clear to Rose whether these problems will scale far enough to pose an existential risk.

Rose considers the conceptual arguments for power-seeking behavior from AI systems to be strong, but notes that she could not find any clear examples of power-seeking AI so far.

With these considerations, Rose thinks that it’s hard to be very confident either that misaligned power-seeking poses a large existential risk, or that it poses no existential risk. She finds this uncertainty to be concerning, given the severity of the potential risks in question. Rose also expressed that it would be good to have more reviews of evidence, including evidence for other claims about AI risks[5] and evidence against AI risks.[6]

  1. ^

    “An AI is misaligned whenever it chooses behaviors based on a reward function that is different from the true welfare of relevant humans.” (Hadfield-Menell & Hadfield, 2019)

  2. ^

    Rose follows (Carlsmith, 2022) and defines power-seeking as “active efforts by an AI system to gain and maintain power in ways that designers didn’t intend, arising from problems with that system’s objectives.”

  3. ^

    “Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome.” (Krakovna et al., 2020).

  4. ^

    “Goal misgeneralization is a specific form of robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations.” (Shah et al., 2022a).

  5. ^

    Joseph Carlsmith’s report Is Power-Seeking AI an Existential Risk? Reviews some evidence for most of the claims that are central to the argument that AI will pose an existential risk.

  6. ^

    Last year, Katja wrote Counterarguments to the basic AI x-risk case, which outlines some arguments against existential risk from AI.