# Game-theoretic Alignment in terms of Attainable Utility

### Acknowledgements:

This article is a writeup of research conducted through the SERI program under the mentorship of Alex Turner. It extends our research on game-theoretic POWER and Alex’s research on POWER-seeking.

Thank you to Alex for being better at this than I am (hence mentorship, I suppose) and to SERI for the opportunity to conduct this research.

## Motivation: POWER-scarcity

The starting point for this post is the idea of POWER-scarcity: as unaligned agents grow smarter and more capable, they will eventually compete for power (as a convention, “power” is the intuitive notion while “POWER” is the formal concept). Much of the foundational research behind this project is devoted to justifying that claim: Alex’s original work suggests POWER-seeking behavior and in particular catastrophic risks associated with competition for POWER, while our previous project formalizes POWER-scarcity in a game-theoretic framework.

One of the major results of our previous project was a proof that POWER is scarce in the special case of constant sum games. Additionally, we had a partial notion of “POWER isn’t scarce by the definition we care about” for common-payoff games. We interpret these results as limiting cases of a more general relationship between “agent alignment” and POWER-scarcity:

In a common-payoff game, players are “maximally aligned” in the sense that their incentives are identical (in the VNM sense of identical preference orderings between states). We don’t have a clean expression for this in terms of POWER; the simplest relevant consequence is “the action profile that maximizes the common payoff is individually optimal for each player simultaneously”. We present a more natural characterization later in the post.

In a constant-sum game, players are “maximally unaligned” in the sense that there’s no concept of collective preferences: in the utilitarian sense, the group is ambivalent between all outcomes of the game. We proved that in a constant-sum game, all Nash Equilibrium strategy profiles have constant sum POWER.

Given these results, we hypothesize that a more general relationship exists between agent alignment and POWER-scarcity. However, it’s unclear how to define “agent alignment” as a property of an arbitrary multiplayer game; the limiting cases only capture narrow intuitions about obviously (un)aligned agents.

This presented a clear set of research goals moving forward:

Define a formal notion of “agent alignment” given an arbitrary (normal-form) multiplayer game

Relate the formal notion of alignment to POWER-scarcity

I consider our project to make substantial progress on (1) and to suggest avenues of attack for (2), though not the ones we expected.

## Desiderata for Alignment Metrics

Setting out towards addressing (1), our optimistic roadmap looked something like this:

Describe an “alignment metric” mapping a game to some real number (or vector), loosely describing how aligned the game’s players are.

Find an (in)equality relating POWER-scarcity to the defined alignment metric

Already, our approach presupposes a lot: a real-valued alignment metric is a much more specific notion than merely “agent alignment in a multiplayer game”. However, we have an essentially scalar definition of POWER-scarcity already, so phrasing everything in terms of real-number inequalities would be ideal. Taking this as motivation, we narrow our “formal notion of agent alignment” from (1) into “-valued alignment metric”.

This narrows our problem statement to the point where we can start specifying criteria:

**Consistency with the limiting cases for agent (un)alignment.**In particular, the alignment metric should be (in some underspecified sense) “maximized” for common-payoff games and “minimized” for constant-sum games.**Consistency with affine transformations.**In particular, applying any affine transformation to each player’s utility function should “preserve the structure of the game”, which should be represented in the alignment metric. This criterion can be strengthened in the following ways:*Alignment metric is an affine function of players’ utilities.*Since the composition of affine functions is affine, this condition implies the above.*Consistency under affine transformations for individual players.*The intuition is that affine transformations are precisely the set of transformations that “preserve preferences” in the VNM sense.

Another relevant distinction to be drawn is between *global* and *local* alignment metrics. Mathematically, we define a global metric to be strictly a function of a multi-player game, while a local metric is a function of both the game and a strategy profile. Intuitively, local metrics can “see” information about the strategies actually being played, while global metrics are forced to address the complexity of the entire game.

Local metrics tend to be a lot simpler than global metrics, since they can ignore much of the difficulty of game theory. However, we can construct a simple class of global metrics by defining some “natural” strategy profile for each game. We call these the *localized* global metrics, equipped with a *localizing* function that, given a game, chooses a strategy profile.

### Examples of Alignment Metrics

To give intuition on what such alignment metrics might look like, we present a few examples of simple alignment metrics for 2-player games, then test them on some simple, commonly-referenced games.

We’ll be using the following games as examples:

We’ll consider the following alignment metrics:

**Sum of utility: **

Considering the metric on our example games yields the following:

Matching Pennies is a zero-sum game, so the sum of utility will be uniformly zero ().

Prisoners’ Dilemma will have , where is the number of players who choose to cooperate (as opposed to defect). The metric (correctly) suggests that alignment is (in some sense) correlated with cooperation in PD.

Within this category, there are still some degrees of freedom. We can consider the local metric of expected sum utility given a strategy profile, or construct a number of localized global metrics by varying our choice of localizing function (example: max sum utility, minimax, …).

Such metrics are constant for the constant-sum case, but vary in the common-payoff case, thus partially satisfying the “limiting cases” criterion. Summation is itself an affine transformation, so this metric fulfills the stronger version of the “affine transformation” criterion.

**Covariance of utility: **

Considering the metric on our example games yields the following:

Matching Pennies is a zero-sum game, so we have . Thus, . Note that , suggesting that the covariance metric tends to consider constant-sum games as less aligned than affine metrics.

In the Prisoners’ Dilemma, we see that the change in reward generated by a player defecting is [+1, −2]. Thus, if we fix the strategy profile of “player i defects with probability ”, then we find . For example, for we have , suggesting that agents are relatively misaligned.

This “feels like” a local metric in the sense that there aren’t clear choices of localizing functions from which to define a localized global metric (in particular, the choice would significantly and unpredictably impact the metric).

This metric is consistent with the “limiting cases” criterion by properties of the covariance function. The relationship to the “affine transformation” criterion is odd: instead of an affine function of players’ utilities, covariance is a (simple) bilinear function. Thus, the metric is an affine function *in each component utility*, but not in the vector of players’ utilities.

Additionally, note that if is a constant variable, then . Thus, if the strategy profile is deterministic, our metric will be .

## Social Welfare and the Coordination-Alignment Inequalities

Another approach to the problem of alignment metrics comes from specifying what we mean by “alignment”. For the purposes of this section, we define “alignment” to be alignment with social welfare, which we define below:

Consider an arbitrary -player game, where player has utility given an action profile . Now, choose a *social welfare function *. Harsanyi’s theorem suggests that is an affine function; we’ll choose for simplicity. Informally, we’ll now take “alignment of player ” to mean “alignment of with ”.

We start with the following common-sense bounds on , which we call the *Coordination-Alignment Inequalities:*

We call the first inequality the *Coordination Inequality*, and the second inequality the *Alignment Inequality*. We present some basic intuition:

The Coordination inequality represents the difference between attained social welfare (“how well we’re doing right now”) and

*maximum attainable*social welfare (“the best we can possibly do”).The Alignment inequality represents the difference between attainable social welfare (“the best we can possibly do”) and

*each player’s max attainable*social welfare (“the best we could possibly do, in a world in which everyone simultaneously gets their way”).

As it turns out, the limiting cases of alignment have a natural interpretation in terms of the C-A inequalities: they’re just equality cases!

In a common-payoff game, the global max common payoff achieves both the max attainable social welfare and the max individual payoffs for each player. Thus, common-payoff games are an equality case of the Alignment inequality.

In a

*constant-welfare game*(where is constant), max social welfare is trivially achieved, so constant-welfare games are an equality case of the Coordination inequality.

There are some caveats to this interpretation, listed below:

While the “limiting cases” for alignment are equality cases of the C-A inequalities, they’re not a full characterization of the equality cases.

The Coordination inequality is an equality iff the action profile maximizes social welfare. The set of games for which all action profiles maximize welfare is precisely the constant-welfare games.

The Alignment inequality is an equality iff there exists a unique Pareto efficient payoff profile. This payoff profile must be optimal for each player, otherwise some preferred profile would also be Pareto efficient. This class of games is (superficially) much broader than the common-payoff games, but both have unique Pareto efficient Nash Equilibria which can be thought of as “max attainable utility”.

### Constructing the C-A Alignment Metric

Motivated by our framing of limiting cases with the C-A inequalities we can construct a simple alignment metric using the alignment inequality. In particular, we define *misalignment* as the positive difference in the terms of the alignment inequality, then *alignment* as negative misalignment. Doing the algebra and letting denote the alignment metric, we find the following:

A few quick observations:

Note that , with equality cases identical to those of the Alignment inequality. Intuitively, measures how much worse the real game is than the “ideal world” in which each player simultaneously achieves their max attainable utility.

We see that the positive term of is just max attainable social welfare. This makes sense intuitively—we’d expect a group of aligned agents to achieve high social welfare, while we’d expect misaligned agents to fare worse.

The definition of is sensitive to “small changes” in AU landscape; adding an “implausible” but hugely beneficial scenario (e.g. winning the lottery) can drastically change . We consider this a property of global alignment metrics in general: since we have no strategy profile by which to judge actions as “plausible”, the metric has no way of ignoring implausible scenarios.

We now perform the same analysis as with our example alignment metrics:

We see that is consistent with limiting cases of alignment in the sense that , with the bounds corresponding to the proper limiting cases. Additionally, we see that is consistent with affine transformations of . In fact, for finite games is a *piecewise* affine function in , since the max terms provide a finite number of nondifferentiable points.

Considering the metric on our example games yields the following:

Matching Pennies has . Note that Matching Pennies is a zero-sum game, so is “minimal” in the sense that all the lost social welfare comes from alignment issues as opposed to coordination ones.

Prisoners’ Dilemma has (coincidence? Yes; it’s a consequence of arbitrary choices of reward sizes in the game definitions). The difference can be thought of as the difference in reward between mutual cooperation and “magically, both players get to unilaterally defect on the other”.

As a final disclaimer, we don’t claim that is the “one true alignment metric” and that our research question is solved. We think that the C-A inequalities are probably significant for the eventual POWER-scarcity application and that illustrates this point nicely. We don’t mean to downplay our own research, but more investigation would be needed to pin down “The” alignment metric and relate it directly to POWER-scarcity.

## Connections to Broader Game Theory

There are a number of connections between the theory surrounding the C-A inequalities and game theory at large. We explore one such connection, bridging the divide between (Harsanyi) utilitarianism and ideas from Bargaining theory.

To begin, we choose the natural strategy profile of maxmin, which we denote as . Now, define the *surplus* of player to be

A few quick observations, assuming is linear for convenience:

We trivially have , and thus . By the Coordination Inequality, we have the stronger statement of .

There isn’t a fundamental lower bound on - players can (in theory) lose as badly as they want.

By definition of maxmin, if player i plays a best response, then . Intuitively, we motivate using maxmin as the natural strategy profile by allowing the guarantee of nonnegative surplus.

Analysis of surplus can be viewed through the lens of bargaining theory. The maxmin strategy profile is a natural choice of threat point, since it’s the optimal guaranteed outcome given no cooperation. Thus, players are “bargaining for surplus”, with threat point of each player receiving zero surplus.

Given the bargaining framework, we can consider the Nash bargaining solution maximizing the product of surplus. We see that the product being positive is equivalent to each player attaining positive surplus, which is equivalent to the bargaining solution being strictly better than the threat point for each player.

Beyond this observation, we don’t know of direct connections between bargaining for surplus and maximizing social welfare. One promising area for further research stems from the observation that the Nash bargaining outcome is invariant under affine transformations of the component utilities. I suspect that the parallel between this invariance and Harsanyi utilitarianism adding an “arbitrary” affine transformation indicates a common principle that could shed further light on the C-A inequalities.

## Future research

While we’re excited about the framing of the C-A inequality, we consider it a landmark in mostly unexplored territory. For instance, we still can’t answer the following basic questions:

“what’s the exact alignment metric?”

“what’s the connection between individual incentives and social welfare?”

“where’s the line between cooperative notions of social welfare and competitive notions of bargaining equilibria?”

“how the heck does this all connect back to POWER-scarcity?”

These questions deserve answers, and we plan to continue exploring the landscape of game-theoretic alignment in search of understanding these questions. On that note, feedback is welcome!

(Moderation note: added to the Alignment Forum from LessWrong.)

That moment when the AI takes a treacherous turn

because it wasn’t aligned up to affine transformations.