The Causes of Power-seeking and Instrumental Convergence

Instrumental convergence posits that smart goal-directed agents will tend to take certain actions (eg gain resources, stay alive) in order to achieve their goals. These actions seem to involve taking power from humans. Human disempowerment seems like a key part of how AI might go very, very wrong.

But where does instrumental convergence come from? When does it occur, and how strongly? And what does the math look like?

Seek­ing Power is Often Con­ver­gently In­stru­men­tal in MDPs

Power as Easily Ex­ploitable Opportunities

The Catas­trophic Con­ver­gence Conjecture

Gen­er­al­iz­ing POWER to multi-agent games

MDP mod­els are de­ter­mined by the agent ar­chi­tec­ture and the en­vi­ron­men­tal dynamics

En­vi­ron­men­tal Struc­ture Can Cause In­stru­men­tal Convergence

A world in which the al­ign­ment prob­lem seems lower-stakes

The More Power At Stake, The Stronger In­stru­men­tal Con­ver­gence Gets For Op­ti­mal Policies

Seek­ing Power is Con­ver­gently In­stru­men­tal in a Broad Class of Environments

When Most VNM-Co­her­ent Prefer­ence Order­ings Have Con­ver­gent In­stru­men­tal Incentives

Satis­ficers Tend To Seek Power: In­stru­men­tal Con­ver­gence Via Retargetability

Cor­rigi­bil­ity Can Be VNM-Incoherent

In­stru­men­tal Con­ver­gence For Real­is­tic Agent Objectives