In my experience, constant-sum games are considered to provide “maximally unaligned” incentives, and common-payoff games are considered to provide “maximally aligned” incentives. How do we quantitatively interpolate between these two extremes? That is, given an arbitrary payoff table representing a two-player normal-form game (like Prisoner’s Dilemma), what extra information do we need in order to produce a real number quantifying agent alignment?

If this question is ill-posed, why is it ill-posed? And if it’s not, we should probably understand how to quantify such a basic aspect of multi-agent interactions, if we want to reason about complicated multi-agent situations whose outcomes determine the value of humanity’s future. (I started considering this question with Jacob Stavrianos over the last few months, while supervising his SERI project.)

Thoughts:

Assume the alignment function has range or .

Constant-sum games should have minimal alignment value, and common-payoff games should have maximal alignment value.

The function probably has to consider a strategy profile (since different parts of a normal-form game can have different incentives; see e.g. equilibrium selection).

The function should probably be a function of player A’s alignment

*with*player B; for example, in a prisoner’s dilemma, player A might always cooperate and player B might always defect. Then it seems reasonable to consider whether A is*aligned with*B (in some sense), while B is not aligned with A (they pursue their own payoff without regard for A’s payoff).So the function need not be symmetric over players.

The function should be invariant to applying a separate positive affine transformation to each player’s payoffs; it shouldn’t matter whether you add 3 to player 1′s payoffs, or multiply the payoffs by a half.

~~The function may or may not rely only on the players’ orderings over outcome lotteries, ignoring the cardinal payoff values. I haven’t thought much about this point, but it seems important.~~EDIT: I no longer think this point is important, but rather confused.

If I were interested in thinking about this more right now, I would:

Do some thought experiments to pin down the intuitive concept. Consider simple games where my “alignment” concept returns a clear verdict, and use these to derive functional constraints (like symmetry in players, or the range of the function, or the extreme cases).

See if I can get enough functional constraints to pin down a reasonable family of candidate solutions, or at least pin down the type signature.

Consider any finite two-player game in normal form (each player can have any finite number of strategies, we can also easily generalize to certain classes of infinite games). Let SA be the set of pure strategies of player A and SB the set of pure strategies of player B. Let uA:SA×SB→R be the utility function of player A. Let (α,β)∈ΔSA×ΔSB be a particular (mixed) outcome. Then the alignment of player B with player A in this outcome is defined to be:

aB/A(α,β):=Eα×β[uA]−minβ′∈SBEα×β′[uA]maxβ′∈SBEα×β′[uA]−minβ′∈SBEα×β′[uA]∈[0,1]

Ofc so far it doesn’t depend on uB at all. However, we can make it depend on uB if we use uB to impose assumptions on (α,β), such as:

β is a uB-best response to α or

(α,β) is a Nash equilibrium (or other solution concept)

Caveat: If we go with the Nash equilibrium option, aB/A can become “systematically” ill-defined (consider e.g. the Nash equilibrium of matching pennies). To avoid this, we can switch to the extensive-form game where B chooses their strategy after seeing A’s strategy.

In a sense, your proposal quantifies the extent to which

Bselects a best responseon behalf of A, given some mixed outcome. I like this. I also think that “it doesn’t necessarily depend on uB” is a feature, not a bug.EDIT: To handle

~~common-~~constant-payoff games, we might want to define the alignment to equal 1 if the denominator is 0. In that case, the response of B can’t affect A’s expected utility, and so it’s not possible for B to actagainstA’s interests. So we might as well say that B is (trivially) aligned, given such a mixed outcome?In common-payoff games the denominator is

notzero, in general. For example, suppose that SA=SB={a,b}, uA(a,a)=uA(b,b)=1, uA(a,b)=uA(b,a)=0, uB≡eA, α=β=δa. Then aB/A(α,β)=1, as expected: current payoff is 1, if B played b it would be 0.You’re right. Per Jonah Moss’s comment, I happened to be thinking of games where playoff is constant across players and outcomes, which is a very narrow kind of common-payoff (and constant-sum) game.

I don’t think in this case aB/A should be defined to be 1. It seems perfectly justified to leave it undefined, since in such a game B can be equally well conceptualized as maximally aligned or as maximally anti-aligned. It

istrue that if, out of some set of objects you consider the subset of those that have aB/A=1, then it’s natural to include the undefined cases too. But, if out of some set of objects you consider the subset of those that have aB/A=0, then it’salsonatural to include the undefined cases. This is similar to how (0,0)∈R2 is simultaneously in the closure of {xy=1} and in the closure of {xy=−1}, so 00 can be considered to be either 1 or −1 (or any other number) depending on context.This also suggests that “selfless” perfect B/A alignment is possible in zero-sum games, with the “maximal misalignment” only occuring if we assume B plays a best response. I think this is conceptually correct, and not something I had realized pre-theoretically.

✅Pending unforeseen complications, I consider this answer to solve the open problem. It essentially formalizes B’simpact alignmentThere might still be other interesting notions of alignment, but I think this is at least

animportant notion in the normal-form setting (and perhaps beyond).I agree that this is measuring something of interest, but it doesn’t feel to me as if it solves the problem I thought you said you had.

This describes how well aligned an

individual actionby B is with A’s interests. (The action in question is B’s choice of (mixed) strategy β, when A has chosen (mixed) strategy α.) The number is 0 when B chooses the worst-for-A option available, 1 when B chooses the best-for-A option available, and in between scales in proportion to A’s expected utility.But your original question was, on the face of it, looking for something that describes the effect on alignment of a

gamerather than one particular outcome:or perhaps the alignment of particular

agentsplaying a particulargame.I think Vanessa’s proposal is the right answer to the question it’s answering, but the question it’s answering seems rather different from the one you seemed to be asking. It feels like a type error:

outcomescan be “good”, “bad”, “favourable”, “unfavourable”, etc., but it’s things likeagentsandincentivesthat can be “aligned” or “unaligned”.When we talk about some agent (e.g., a hypothetical superintelligent AI) being “aligned” to some extent with our values, it seems to me we don’t just mean whether or not, in a particular case, it acts in ways that suit us. What we want is that

in general, over a wide range of possible situations, it will tend to act in ways that suit us. That seems like something this definition couldn’t give us—unless you take the “game” to bethe entirety of everything it does, so that a “strategy” for the AI is simplyits entire program, and then asking for this coefficient-of-alignment to be large is precisely the same thing as asking for the expected behaviour of the AI, across its whole existence, to produce high utility for us. Which, indeed, is what we want, but this formalism doesn’t seem to me to add anything we didn’t already have by saying “we want the AI’s behaviour to have high expected utility for us”.It feels to me as if there’s more to be done in order to cash out e.g. your suggestion that constant-sum games are ill-aligned and common-payoff games are well-aligned. Maybe it’s enough to say that for these games, whatever strategy A picks, B’s payoff-maximizing strategy yields Kosoy coefficient 0 in the former case and 1 in the latter. That is, B’s incentives point in a direction that produces (un)favourable outcomes for A. The Kosoy coefficient quantifies the (un)favourableness of the outcomes; we want something on top of that to express the (mis)alignment of the incentives.

(To be clear, of course it may be that what you were intending to ask for is exactly what Vanessa provided, and you have every right to be interested in whatever questions you’re interested in. I’m just trying to explain why the question Vanessa answered doesn’t feel to me like the key question if you’re asking about how well aligned one agent is with another in a particular context.)

So, something like “fraction of preferred states shared” ? Describe preferred states for P1 as cells in the payoff matrix that are best for P1 for each P2 action (and preferred stated for P2 in a similar manner) Fraction of P1 preferred states that are also preferred for P2 is measurement of alignment P1 to P2. Fraction of shared states between players to total number of preferred states is measure of total alignment of the game.

For 2x2 game each player will have 2 preferred states (corresponding to the 2 possible action of the opponent). If 1 of them will be the same cell that will mean that each player is 50% aligned to other (1 of 2 shared) and the game in total is 33% aligned (1 of 3), This also generalize easily to NxN case and for >2 players.

And if there are K multiple cells with the same payoff to choose from for some opponent action we can give 1/K to them instead of 1.

(it would be much easier to explain with a picture and/or table, but I’m pretty new here and wasn’t able to find how to do them here yet)

Does agency matter? There are 21 x 21 x 4 possible payoff matrixes for a 2x2 game if we use Ordinal payoffs. For the vast majority of them (all but about 7 x 7 x 4 of them) , one or both players can make a decision without knowing or caring what the other player’s payoffs are, and get the best possible result. Of the remaining 182 arrangements, 55 have exactly one box where both players get their #1 payoff (and, therefore, will easily select that as the equilibrium).

All the interesting choices happen in the other 128ish arrangements,

^{6}⁄_{7}of which have the pattern of the preferred (1st and 1st, or 1st and 2nd) options being on a diagonal. The most interesting one (for the player picking the row, and getting the first payoff) is:1 / (2, 3, or 4) ; 4 / (any)

2 / (any) ; 3 / (any)

The optimal strategy for any interesting layout will be a mixed strategy, with the % split dependent on the relative Cardinal payoffs (which are generally not calculatable since they include Reputation and other non-quantifiable effects).

Therefore, you would want to weight the quality of any particular result by the chance of that result being achieved (which also works for the degenerate cases where one box gets 100% of the results, or two perfectly equivalent boxes share that)

I like this answer, and I’m going to take more time to chew on it.

1/10/0^{0}⁄_{0}0.8/-1I have put the preferred state for each player in bold. I think by your rule this works out to 50% aligned. However, the Nash equilibrium is both players choosing the

^{1}⁄_{1}result, which seems perfectly aligned (intuitively).^{1}⁄_{0}.5^{0}⁄_{0}^{0}⁄_{0}0.5/1In this game, all preferred states are shared, yet there is a Nash equilibrium where each player plays the move that can get them 1 point

^{2}⁄_{3}of the time, and the other move^{1}⁄_{3}of the time. I think it would be incorrect to call this 100% aligned.(These examples were not obvious to me, and tracking them down helped me appreciate the question more. Thank you.)

Thanks for careful analysis, I must confess that my metric does not consider the stochastic strategies, and in general works better if players actions are taken consequently, not simultaneously (which is much different from the classic description).

The reasoning being that for maximal alignment each action of P1 there exist exactly one action of P2 (and vice versa) that is Nash equilibrium. In this case the game stops in stable state after single pair of actions. And maximally unaligned game will have no nash equilibrium at all, meaning the players actions-reactions will just move over the matrix in closed loop.

Overall, my solution as is seems not fitted for the classical formulation of the game :) but thanks for considering it!

I think this is backward. The game’s payout matrix

determinesthe alignment. Fixed-sum games imply (in the mathematical sense) unaligned players, and common-payoff gamesAREthe definition of alignment.When you start looking at meta-games (where resource payoffs differ from utility payoffs, based on agent goals), then “alignment” starts to make sense as a distinct measurement—it’s how much the players’ utility functions transform the payoffs (in the sub-games of a series, and in the overall game) from fixed-sum to common-payoff.

I don’t follow. How can fixed-sum games mathematically imply unaligned players, without a formal metric of alignment between the players?

Also, the payout matrix need not determine the alignment, since each player could have a different policy from strategy profiles to responses, which in principle doesn’t have to select a best response. For example, imagine playing stag hunt with someone who responds ‘hare’ to stag/stag; this isn’t a best response for them, but it minimizes your payoff. However, another partner could respond ‘stag’ to stag/stag, which (I think) makes them “less unaligned with you” with you than the partner who responds ‘hare’ to stag/stag.

Payout correlation

ISthe metric of alignment. A player who isn’t trying to maximize their (utility) payout is actually not playing the game you’ve defined. You’re simply incorrect (or describing a different payout matrix than you state) that a player doesn’t “have to select a best response”.I think “X and Y are playing a game of stag hunt” has multiple meanings.

The meaning generally assumed in game theory when considering just a single game is that the outcomes in the game matrix are utilities. In that case, I completely agree with Dagon: if on some occasion you prefer to pick “hare” even though you know I will pick “stag”, then

we are not actually playing the stag hunt game. (Because part of what itmeansto be playing stag hunt rather than some other game is that we both consider (stag,stag) the best outcome.)But there are some other situations that might be described by saying that X and Y are playing stag hunt.

Maybe we are playing an iterated stag hunt. Then (by definition) what I care about is still some sort of aggregation of per-round outcomes, and (by definition) each round’s outcome still has (stag,stag) best for me, etc. -- but now I need to strategize over the whole course of the game, and e.g. maybe I think that on a particular occasion choosing “hare” when you chose “stag” will make you understand that you’re being punished for a previous choice of “hare” and make you more likely to choose “stag” in future.

Or maybe we’re playing an

iteratediterated stag hunt. Now maybe I choose “hare” when you chose “stag”, knowing that it will make things worse for me over subsequent rounds, but hoping thatother peoplelooking at our interactions will learn the rule Don’t Fuck With Gareth and never, ever choose anything other than “stag” when playing with me.Or maybe we’re playing a game in which the stag hunt matrix describes

some sort of payouts that are not exactly utilities. E.g., we’re in a psychology experiment and the experimenter has shown us a 2x2 table telling us how manydollarswe will get in various cases—but maybe I’m a billionaire and literally don’t care whether I get $1 or $10 and figure I might as well try to maximizeyourpayout, or maybe you’re a perfect altruist and (in the absence of any knowledge about our financial situations) you just want to maximize the total take, or maybe I’m actually evil and want you to do as badly as possible.In the iterated cases, it seems to me that the payout matrix still determines alignment

given the iteration context—how many games, with what opponents, with what aggregation of per-round utilities to yield overall utility (in prospect or in retrospect; the former may involve temporal discounting too). If I don’t consider a long string of (stag,stag) games optimal then, again, we are not really playing (iterated) stag hunt.In the payouts-aren’t-really-utilities case, I think it

doesmake sense to ask about the players’ alignment, in terms of how they translate payouts into utilities. But … it feels to me as if this is now basically separate from the actual game itself: the thing we might want to map to a measure of alignedness is something like the function from (both players’ payouts) to (both players’ utilities). The choice of game may then affect how farunaligned players imply unaligned actions, though. (In a game with (cooperate,defect) options where “cooperate” is always much better for the player choosing it than “defect”, the payouts->utilities function would need to be badly anti-aligned, with players actively preferring to harm one another, in order to get uncooperative actions; in a prisoners’ dilemma, it suffices that it not be stronglyaligned; each player can slightly prefer the other to do better but still choose defection.)Thanks for the thoughtful response.

It seems to me like you’re assuming that players must respond rationally, or else they’re playing a different game, in some sense. But why? The stag hunt game is defined by a certain set of payoff inequalities holding in the game. Both players can consider (stag,stag) the best outcome, but that doesn’t mean they

have to playstag against (stag, stag). That requires further rationality assumptions (which I don’t think are necessary in this case).If I’m playing against someone who always defects against cooperate/cooperate, versus against someone who always cooperates against cooperate/cooperate, am I “not playing iterated PD” in one of those cases?

I’m not 100% sure I am understanding your terminology. What does it mean to “play stag against (stag,stag)” or to “defect against cooperate/cooperate”?

If your opponent is not in any sense a utility-maximizer then I don’t think it makes sense to talk about your opponent’s utilities, which means that it doesn’t make sense to have a payout matrix denominated in utility, which means that we are not in the situation of my second paragraph above (“The meaning generally assumed in game theory...”).

We might be in the situation of my last-but-two paragraph (“Or maybe we’re playing a game in which...”): the payouts might be something other than utilities. Dollars, perhaps, or just numbers written on a piece of paper. In that case, all the things I said about that situation apply here. In particular, I agree that it’s then reasonable to ask “how aligned is B with A’s interests?”, but I think this question is largely decoupled from the specific game and is more about the mapping from (A’s payout, B’s payout) to (A’s utility, B’s utility).

I guess there are cases where that isn’t enough, where A’s and/or B’s utility is not a function of the payouts alone. Maybe A just likes saying the word “defect”. Maybe B likes to be seen as the sort of person who cooperates. Etc. But at this point it feels to me as if we’ve left behind most of the simplicity and elegance that we might have hoped to bring by adopting the “two-player game in normal form” formalism in the first place, and if you’re prepared to consider scenarios where A just likes choosing the top-left cell in a 2x2 array then you also need to consider ones like the ones I described earlier in this paragraph—where in fact it’s

notjust the 2x2 payout matrix that matters but potentially any arbitrary details about what words are used when playing the game, or who is watching, or anything else. So if you’re trying to get to the essence of alignment by considering simple 2x2 games, I think it would be best to leave that sort of thing out of it, and in that case my feeling is that your options are (a) to treat the payouts as actual utilities (in which case, once again, I agree with Dagon and think all the alignment information is in the payout matrix), or (b) to treat them as mere utility-function-fodder, but to assume that they’reallthe fodder the utility functions get (in which case, as above, I thinknoneof the alignment information is in the payout matrix and it’s all in the payouts-to-utilities mapping), or (c) to consider some sort of iterated-game setup (in which case, I think you need to nail downwhatsort of iterated-game setup before asking how to get a measure of alignment out of it).Let πi(σ)=σ′i be player i’s response function to strategy profile σ. Given some strategy profile (like stag/stag), player i selects a response. I mean “response” in terms of “best response”—I don’t necessarily mean that there’s an iterated game. This captures all the relevant “outside details” for how decisions are made.

I don’t think I understand where this viewpoint is coming from. I’m not equating payoffs with VNM-utility, and I don’t think game theory usually does either—for example, the maxmin payoff solution concept does not involve VNM-rational expected utility maximization. I just identify payoffs with “how good is this outcome for the player”, without also demanding that πi always select a best response. Maybe it’s Boltzmann rational, or maybe it just always selects certain actions (regardless of their expected payouts).

There exist two payoff functions. I think I want to know how impact-aligned one player is with another: how do the player’s actual actions affect the other player (in terms of their numerical payoff values). I think (c) is closest to what I’m considering, but in terms of response functions—not actual iterated games.

Sorry, I’m guessing this probably still isn’t clear, but this is the reply I have time to type right now and I figured I’d send it rather than nothing.

Sorry, I think I wasn’t clear about what I don’t understand. What is a “strategy profile (like stag/stag)”? So far as I can tell, the usual meaning of “strategy profile” is the same as that of “strategy”, and a strategy in a one-shot game of stag hunt looks like “stag” or “hare”, or maybe “70% stag, 30% hare”; I don’t understand what “stag/stag” means here.

----

It is absolutely standard in game theory to equate payoffs with utilities. That doesn’t mean that you have to do the same, of course, but I’m sure that’s why Dagon said what he did and it’s why when I was enumerating possible interpretations that was the first one I mentioned.

(The next several paragraphs are just giving some evidence for this; I had a look on my shelves and described what I found. Most detail is given for the one book that’s specifically about formalized 2-player game theory.)

“Two-Person Game Theory” by Rapoport, which happens to be the only book dedicated to this topic I have on my shelves, says this at the start of chapter 2 (titled “Utilities”):

Unfortunately, Rapoport is using the word “payoffs” to mean two different things here. I think it’s entirely clear from context, though, that his actual meaning is: you may begin by specifying monetary payoffs, but what we care about for game theory is payoffs as utilities. Here’s more from a little later in the chapter:

A bit later:

and:

As I say, that’s the only book of formal game theory on my shelves. Schelling’s

Strategy of Conflicthas a little to say about such games, but not much and not in much detail, but it looks to me as if he assumes payoffs are utilities. The following sentence is informative, though it presupposes rather than stating: “But what configuration of value systems for the two participants—of the “payoffs”, in the language of game theory—makes a deterrent threat credible?” (This is from the chapter entitled “International Strategy”; in my copy it’s on page 13.)Rapoport’s “Strategy and Conscience” isn’t a book of formal game theory, but it does discuss the topic, and it explicitly says: payoffs are utilities.

One chapter in Schelling’s “Choice and Consequence” is concerned with this sort of game theory; he says that the numbers you put in the matrix are

eitherarbitrary things whose relative ordering is the only thing that matters,ornumbers that behave like utilities in the sense that the players are trying to maximize their expectations.The Wikipedia article on game theory says: “The payoffs of the game are generally taken to represent the utility of individual players.” (This is in the section about the use of game theory in economics and business. It does also mention applications in evolutionary biology, where the payoffs are fitnesses—which seem to me very closely analogous to utilities, in that what the evolutionary process stochastically maximizes is something like expected fitness.)

Again, I don’t claim that you

haveto equate payoffs with utilities; you can apply the formalism of game theory in any way you please! But I don’t think there’s any question that this is theusualway in which payoffs in a game matrix are understood.----

It feels odd to me to focus on response functions, since as a matter of fact you never actually know the other player’s strategy. (Aside from special cases where your opponent is sufficiently deterministic and sufficiently simple that you can “read their source code” and make reliable predictions from it. There’s a bit of an LW tradition of thinking in those terms, but I think that with the possible exception of reasoning along the lines of “X is an

exact copyof me and will therefore make the same decisions as I do” it’s basically never going to be relevant to real decision-making agents because the usual case is that the other player is about as complicated as you are, and you don’t have enough brainpower to understand your own brain completely.)If you are

notconsidering payouts to be utilities, then you need to note that knowing the other player’s payouts—which is a crucial part of playing this sort of game—doesn’t tell you anythinguntil you also know how those payouts correspond to utilities, or to whatever else the other player might use to guide their decision-making.(If you

aren’tconsidering that they’re utilities butareassuming that higher is better, then for many purposes that’s enough. But, again, only if you suppose that the other player does actually act as someone would act who prefers higher payouts to lower ones.)My feeling is that you will get most insight by adopting (what I claim to be) the standard perspective where payoffs are utilities; then, if you want to try to measure alignment, the payoff matrix is the input for your calculation. Obviously this won’t work if one or both players behave in a way not describable by any utility function, but my suspicion is that in such cases you shouldn’t necessarily expect there to be any sort of meaningful measure of how aligned the players are.

Quote: Or maybe we’re playing a game in which the stag hunt matrix describes

some sort of payouts that are not exactly utilities. E.g., we’re in a psychology experiment and the experimenter has shown us a 2x2 table telling us how manydollarswe will get in various cases—but maybe I’m a billionaire and literally don’t care whether I get $1 or $10 and figure I might as well try to maximizeyourpayout, or maybe you’re a perfect altruist and (in the absence of any knowledge about our financial situations) you just want to maximize the total take, or maybe I’m actually evil and want you to do as badly as possible.So, if the other player is “always cooperate” or “always defect” or any other method of determining results that doesn’t correspond to the payouts in the matrix shown to you, then you aren’t playing “prisoner’s dillema” because the utilities to player B are

notdependent on what you do. In all these games, you should pick your strategy based on how you expect your counterparty to act, which might or might not include the “in game” incentives as influencers of their behavior.Here is the definition of a normal-form game:

You are playing prisoner’s dilemma when certain payoff inequalities are satisfied in the normal-form representation. That’s it. There is no canonical assumption that players are expected utility maximizers, or expected payoff maximizers.

Noting that I don’t follow what you mean by this: do you mean to say that player B’s response cannot be a constant function of strategy profiles (ie the response function cannot be constant everywhere)?

Um… the definition of the normal form game you cited explicitly says that the payoffs are in the form of cardinal or ordinal utilities. Which is distinct from in-game payouts.

Also, too, it sounds like you agree that the strategy your counterparty uses can make a normal form game not count as a “stag hunt” or “prisoner’s dillema” or “dating game”

No. In that article, the only spot where ‘utility’ appears is

identifyingutility with the player’s payoffs/payouts. (EDIT: but perhaps I don’t get what you mean by ‘in-game payouts’?)To reiterate: I’m not talking about VNM-utility, derived by taking a preference ordering-over-lotteries and back out a coherent utility function. I’m talking about the players having payoff functions which cardinally represent the value of different outcomes. We can call the value-units “squiggles”, or “utilons”, or “payouts”; the OP’s question remains.

No, I don’t agree with that.

Do you have a citation? You seem to believe that this is common knowledge among game theorists, but I don’t think I’ve ever encountered that.

Jacob and I have already considered payout correlation, and I agree that it has some desirable properties. However,

it’s symmetric across players,

it’s invariant to player rationality

which matters, since alignment seems to not just be a function of incentives, but of what-actually-happens and how that affects different players

it equally weights each outcome in the normal-form game, ignoring relevant local dynamics. For example, what if part of the game table is zero-sum, and part is common-payoff? Correlation then can be controlled by zero-sum outcomes which are strictly dominated for all players. For example:

1 / 1 || 2 / 2

-.5 / .5 || 1 / 1

and so I don’t think it’s a slam-dunk solution. At the very least, it would require significant support.

Why? I suppose it’s common to assume (a kind of local) rationality for each player, but I’m not interested in assuming that here. It may be easier to analyze the best-response case as a first start, though.

It’s a definitional thing. The

definitionof utility is “the thing people maximize.” If you set up your 2x2 game to haveutilitiesin the payout matrix, then by definition both actors will attempt to pick the box with the biggest number. If you set up your 2x2 game with directpayoutsfrom the game that don’t include phychic (eg “I just like picking the first option given”) or reputational effects, then any concept of alignment is one of:assume the players are trying for the biggest number, how much will they be attempting to land on the same box?

alignment is completely outside of the game, and is one of the features of function that converts game payouts to global utility

You seem to be muddling those two, and wondering “how much will people attempt to land on the same box, taking into account all factors, but only defining the boxes in terms of game payouts.” The answer there is “you can’t.” Because people (and computer programs) have wonky screwed up utility functions (eg (spoiler alert) https://en.wikipedia.org/wiki/Man_of_the_Year_(2006_film))

Only applicable if you’re assuming the players are VNM-rational over outcome lotteries, which I’m not. Forget expected utility maximization.

It seems to me that people are making the question more complicated than it has to be, by projecting their assumptions about what a “game” is. We have payoff numbers describing how “good” each outcome is to each player. We have the strategy spaces, and the possible outcomes of the game. And here’s one approach: fix two response functions in this game, which are functions from strategy profiles to the player’s response strategy. With respect to the payoffs, how “aligned” are these response functions with each other?

This doesn’t make restrictive rationality assumptions. It doesn’t require getting into strange utility assumptions. Most importantly, it’s a clearly-defined question whose answer is both important and not conceptually obvious to me.

(And now that I think of it, I suppose that depending on your response functions, even in zero-sum games, you could have “A aligned with B”, or “B aligned with A”, but not both.)

Then what’s the definition / interpretation of “payoff”, i.e. the numbers you put in the matrix? If they’re not utilities, are they preferences? How can they be preferences if agents can “choose” not to follow them? Where do the numbers come from?

Note that Vanessa’s answer doesn’t need to depend on uB, which I think is its main strength and the reason it makes intuitive sense. (And I like the answer much less when uB is used to impose constraints.)

I think I’ve been unclear in my own terminology, in part because I’m uncertain about what other people have meant by ‘utility’ (what you’d recover from perfect IRL / Savage’s theorem, or cardinal representation of preferences over outcomes?) My stance is that they’re utilities but that I’m not assuming the players are playing best responses in order to maximize expected utility.

Am I allowed to have preferences without knowing how to maximize those preferences, or while being irrational at times? Boltzmann-rational agents have preferences, don’t they? These debates have surprised me; I didn’t think that others tied together “has preferences” and “acts rationally with respect to those preferences.”

There’s a difference between “the agent sometimes makes mistakes in getting what it wants” and “the agent does the literal opposite of what it wants”; in the latter case you have to wonder what the word “wants” even means any more.

My understanding is that you want to include cases like “it’s a fixed-sum game, but agent B decides to be maximally aligned / cooperative and do whatever maximizes A’s utility”, and in that case I start to question what exactly B’s utility function meant in the first place.

I’m told that Minimal Rationality addresses this sort of position, where you allow the agent to make mistakes, but don’t allow it to be e.g. literally pessimal since at that point you have lost the meaning of the word “preference”.

(I kind of also want to take the more radical position where when talking about abstract agents the only meaning of preferences is “revealed preferences”, and then in the special case of humans we also see this totally different thing of “stated preferences” that operates at some totally different layer of abstraction and where talking about “making mistakes in achieving your preferences” makes sense in a way that it does not for revealed preferences. But I don’t think you need to take this position to object to the way it sounds like you’re using the term here.)

Tabooing “aligned” what property are you trying to map on a scale of “constant sum” to “common payoff”?

Good question. I don’t have a crisp answer (part of why this is an open question), but I’ll try a few responses:

To what degree does player 1′s actions further the interests of player 2 within this normal form game, and vice versa?

This version requires specific response functions.

To what degree do the interests of players 1 and 2 coincide within a normal form game?

This feels more like correlation of the payout functions, represented as vectors.

So, given this payoff matrix (where P1 picks a row and gets the first payout, P2 picks column and gets 2nd payout):

5 / 0 ; 5 / 100

0 / 100 ; 0 / 1

Would you say P1′s action furthers the interest of player 2?

Would P2′s action further the interest of player 1?

Where would you rank this game on the 0 − 1 scale?

Hm. At first glance this feels like a “1” game to me, if they both use the “take the strictly dominant action” solution concept. The alignment changes if they make decisions differently, but under the standard rationality assumptions, it feels like a perfectly aligned game.

Correlation between outcomes, not within them. If both players prefer to be in the same box, they are aligned. As we add indifference and opposing choices, they become unalienable. In your example, both people have the exact same ordering of outcome. In a classic PD, there is some mix. Totally unaligned (constant value) example: 0/2

^{2}⁄_{0}2/0^{0}⁄_{2}The usual Pearson correlation in particular is also insensitive to positive affine transformations of either player’s utility, so seems to be about the right thing, doesn’t just try to check if the incomparable utility values are equal.

Another point you could fix using intuition would be complete disinterest. It makes sense to put it at 0 on the [-1, 1] interval.

Assuming rational utility maximizes, a board that results in a disinterested agent would be:

^{1}⁄_{0}^{1}⁄_{1}^{0}⁄_{0}^{0}⁄_{1}Then each agent cannot influence the rewards of the other, so it makes sense to say that they are not aligned.

More generally, if arbitrary changes to one players payoffs have no effect on the behaviour of the other player, then the other player is disinterested.

Correlation between player payouts? In a zero sum game it is −1, when payouts are perfectly aligned it is +1, if payouts are independent it is 0.

I agree that this is a good start, but I find it unsatisfactory.

I’ll take a shot at this. Let A and B be the sets of actions of Alice and Bob. Let on:B→{1,...n} (where ‘n’ means ‘nice’) be function that orders B by how good the choices are for Alice, assuming that Alice gets to choose second. Similarly, let os:B→{1,...,n} (where ‘s’ means ‘selfish’) be the function that orders B by how good the choices are for Bob, assuming that Alice gets to choose second. Choose some function ψ measuring similarity between two orderings of a finite set (should range over [−1,1]); the alignment of B with A is then ψ(on,os).

Example: in the prisoner’s dilemma, B={c,d}, and on orders c>d whereas os orders d>c. Hence ψ(on,om) should be −1, i.e., Bob is maximally unaligned with Alice. Note that this makes it different from Mykhailo’s answer which gives alignment 0.5, i.e., medium aligned rather than maximally unaligned.

This seems like an improvement over correlation since it’s not symmetrical. In the game where Alice and Bob both get to choose numbers x,y∈{1,2} and Alice’s utility function outputs y+x whereas Bob’s outputs y−x, Bob would be perfectly aligned with Alice (his on and os both order 2>1) but Alice perfectly unaligned with Bob (her on orders 1>2 but her os orders 2>1).

I believe this metric meets criteria 1,3,4 you listed. It could be changed to be sensitive to players’ decision theories by changing os (for alignment from Bob to Alice) to be the order output by Bob’s decision theory, but I think that would be a mistake. Suppose I build an AI that is more powerful than myself, and the game is such that we can both decide to steal some of the other’s stuff. If the AI does this, it leads to −10 utils for me and +2 for it (otherwise

^{0}⁄_{0}); if I do it, it leads to −100 utils for me because the AI kills me in response (otherwise^{0}⁄_{0}). This game is trivial: the AI will take my stuff and I’ll do nothing. Also, the AI is maximally unaligned with me. Now suppose I become as powerful as the AI and my ‘take AI’s stuff’ becomes −10 for AI, +2 for me. This makes the game a prisoner’s dilemma. If we both run UDT or FDT, we would now cooperate. If os is the ordering of the AI’s decision theory, this would mean the AI is now aligned with me, which is odd since the only thing that changed is me getting more powerful. With the original proposal, the AI is still maximally unaligned with me. More abstractly, game theory assumes your actions have influence on the other player’s rewards (else the game is trivial), so if you cooperate for game-theoretical reasons, this doesn’t seem to capture what we mean by alignment.Alright, here comes a pretty detailed proposal! The idea is to find out if the sum of expected utility for both players is “small” or “large” using the appropriate normalizers.

First, let’s define some quantities. (I’m not overly familiar with game theory, and my notation and terminology are probably non-standard. Please correct me if that’s the case!)

A. The payoff matrix for player 1.

B. The payoff matrix for player 2.

s,r the mixed strategies for players 1 and 2. These are probability vectors, i.e., vectors of non-negative numbers summing to 1.

Then the expected payoff for player 1 is the bilinear form sTAr=∑i,jsiaijrj and the expected payoff for player 2 is sTBr=∑i,jsibijrj. The sum of payoffs is sT(A+B)r.

But we’re not done defining stuff yet. I interpret alignment to be about welfare. Or how large the sum of utilities is when compared to the best-case scenario and the worst-case scenario. To make an alignment coefficient out of this idea, we will need

u(A,B). The upper bound to the sum of payoffs in the counterfactual situation where the payoff to player 1 is not affected by the actions of player 2, and vice versa. Then u(A,B)=maxu,vuTAv+maxu,vuTBv. Now we find that u(A,B)=maxA+maxB.

Now define the alignment coefficient of the strategies (s,r) in the game defined by the payoff matrices (A,B) as

a=sT(A+B)r−l(A,B)u(A,B)−l(A,B).The intuition is that alignment quantifies how the expected payoff sum sT(A+B)r compares to the best possible payoff sum u(A,B) attainable when the payoffs are independent. If they are equal, we have perfect alignment (a=1). On the other hand, if sT(A+B)r=l(A,B), the expected payoff sum is as bad as it could possibly be, and we have minimal alignment (a=0).

The only problem is that u(A,B)=l(A,B) makes the denominator equal to 0; but in this case, u(A,B)=sT(A+B)r as well, which I believe means that defining a=1 is correct. (It’s also true thatl(A,B)=sT(A+B)r, but I don’t think this matters too much. The players get the best possible outcome no matter how they play, which deserves a=1.) This is an extreme edge case, as it only holds for the special payoff matrices A (B) that contain the same element a (b) in every cell.

Let’s look at some properties:

A pure coordination game has at least one maximal alignment equilibrium, i.e., a(s,r)=1 for some s,r. All of these are necessarily Nash equilibria.

A zero-sum game (that isn’t game-theoretically equivalent to the 0 matrix) has a=0 for every pair of strategies (s,r). This is because sTAr+sTBr=l(A,B)=0 for every s,r. The total payoff is always the worst possible.

The alignment coefficient is linear in a specific senst, i.e., a(A,B)=a(aJ+dA,bJ+dB), where J is the matrix consisting of only 1s.

Now let’s take a look at a variant of the Prisoner’s dilemma with joint payoff matrix

P=[(2,2)(0,3)(3,0)(1,1)]Then

A=[2031],B=AT,A+B=A+AT=[4332].The alignment coefficient at (s,r) is

sT(A+AT)r−26−2=14(4s1r1+3s1(1−r1)+3r1(1−s1)+(1−s1)(1−r1)−2)=s1+r14Assuming pure strategies, we find the following matrix of alignment, where aij is the alignment when player 1 plays i with certainty and player 2 plays j with certainty.

a=[1/21/41/40]Sinces=r=(0,1) is the only Nash equilibrium, the “alignment at rationality” is 0. By taking convex combinations, the range of alignment coefficients is [0,1/2].

Some further comments:

Any general alignment coefficient probably has to be a function of (s,r), as we need to allow them to vary when doing game theory.

Specialized coefficients would only report the alignment at Nash equilibria, maybe the maximal Nash equilibrium.

One may report the maximal alignment without caring about equilibrium points, but then the strategies do not have to be in equilibrium, which I am uneasy with. The maximal alignment for the Prisoner’s dilemma is

^{1}⁄_{2}, but does this matter? Not if we want to quantify the tendency for rational actors to maximize their total utility, at least.Using e.g. the correlation between the payoffs is not a good idea, as it implicitly assumes the uniform distribution on s,r. And why would you do that?

I like how this proposal makes explicit the player strategies, and how they are incorporated into the calculation. I also think that the edge case where the agents actions have no effect on the result

I think that this proposal making alignment symmetric might be undesirable. Taking the prisoner’s dilemma as an example, if s = always cooperate and r = always defect, then I would say s is perfectly aligned with r, and r is not at all aligned with s.

The result of 0 alignment for the Nash equilibrium of PD seems correct.

I think this should be the alignment matrix for pure-strategy, single-shot PD:

a=[1,11,00,10,0]Here the first of each ordered pair represents A’s alignment with B. (assuming we use the [0,1] interval)

I think in this case the alignments are simple, because A can choose to either maximize or to minimize B’s utility.

I believe the upper right-hand corner of a shouldn’t be 1; even if both players are acting in each other’s best interest, they are not acting in their

ownbest interest. And alignment is about having both at the same time. The configuration of Prisoner’s dilemma makes it impossible to make both players maximally satisfied at the same time, so I believe it cannot have maximal alignment for any strategy.Anyhow, your concept of alignment might involve altruism only, which is fair enough. In that case, Vanessa Kosoy has a similar proposal to mine, but not working with sums, which probably does exactly what you are looking for.

Getting alignment in the upper right-hand corner in the Prisoner’s dilemma matrix to be 1 may be possible if we redefine u(A,B) to u(A,B)=maxu,vuT(A+B)v, the best attainable payoff sum. But then zero-sum games will have maximal instead of minimal alignment! (This is one reason why I defined u(A,B)=maxu,vuTAv+maxu,vuTBv.)

(Btw, the coefficient isn’t symmetric; it’s only symmetric for symmetric games. No alignment coefficient depending on the strategies can be symmetric, as the vectors can have different lengths.)

Quick sketch of an idea (written before deeply digesting others’ proposals):

Intuition: Just like player 1 has a best response (starting from a strategy profile s, improve her own utility as much as possible), she also has an altruistic best response (which maximally improves the other player’s utility).

Example: stag hunt. If we’re at (rabbit, rabbit), then both players are perfectly aligned. Even if player 1 was infinitely altruistic, she can’t unilaterally cause a better outcome for player 2.

Definition: given a strategy profile s, an a-altruistic better response is any strategy of one player that gives the other player at least a extra utility for each point of utility that this player sacrifices.

Definition: player 1 is a-aligned with player 2 if player 1 doesn’t have an x-altruistic better response for any x>a.

0-aligned: non-spiteful player. They’ll give “free” utility to other players if possible, but they won’t sacrifice any amount of their own utility for the sake of others.

c-aligned for c∈(0,1): slightly altruistic. Your happiness matters a little bit to them, but not as much as their own.

1-aligned: positive-sum maximizer. They’ll yield their own utility as long as the total sum of utility increases.

c-aligned for c∈(1,∞): subservient player: They’ll optimize your utility with higher priority than their own.

∞-aligned: slave. They maximize others’ utility, completely disregarding their own.

Obvious extension from players to strategy profiles: How altruistic would a player need to be before they would switch strategies?

On re-reading this I messed up something with the direction of the signs. Don’t have time to fix it now, but the idea is hopefully clear.