Formalizing Objections against Surrogate Goals

I recently had a chance to look into surrogate goals and safe Pareto improvements, two recent proposals for avoiding the realization of threats in multiagent scenarios. In this report, I formalize some objections to these proposals.

EDIT: by discussing objections to SG/​SPI, I don’t mean to imply that the objections are novel or unknown to the authors of SG/​SPI. (The converse is true, and most of them even explicitly appear in corresponding posts/​paper.) Rather, I wanted to look at some of the potential issues in more detail.


The surrogate goals (SG) idea proposes that an agent might adopt a new seemingly meaningless goal (such as preventing the existence of a sphere of platinum with a diameter of exactly 42.82cm or really hating being shot by a water gun) to prevent the realization of threats against some goals they actually value (such as staying alive) [TB1, TB2]. If they can commit to treating threats to this goal as seriously as threats to their actual goals, the hope is that the new goal gets threatened instead. In particular, the purpose of this proposal is not to become more resistant to threats. Rather, we hope that if the agent and the threatener misjudge each other (underestimating the commitment to ignore/​carry out the threat), the outcome (Ignore threat, Carry out threat) will be replaced by something harmless.

Safe Pareto improvements (SPIs) [OC] is a formalization and a generalization of this idea. In the straightforward interpretation, the approach applies to a situation where an agent delegates a problem to their representative—but we could also imagine “delegating” to a future self-modified version of oneself. The idea is to give instructions like “if other agents you encounter also have this same instruction, replace ‘real’ conflicts with them with ‘mock’ conflicts, in which you all act like you would in a real conflict, but bad outcomes are replaced by less harmful variants”. Some potential examples would be fighting (should a fight arise) to the first blood instead of to the death or threatening diplomatic conflict instead of a military one. In the SG setting, SPI could correspond to a joint commitment to only use water-gun threats while behaving as if the water gun was real.

In the following text, we first describe SPI in more detail and focus on framing the setting to which it applies and highlighting the assumptions it makes. We then list a number of potential objections to this proposal. The question we study in most detail is what is the right framing for SPI. We also briefly discuss how to apply SPI to imperfect information settings and how it relates to bargaining.

Our tentative conclusion is that (1) we should not expect SG and SPI to be a silver bullet that solves all problems with threats but (2) neither should we expect them to never be useful. However, since our results mostly take the form of observations and examples, they should not be viewed as final. Personally, the author believes[1] that SPI might “add up to normality”—that it will be a sort of reformulation of existing (informal) approaches used by humans, with similar benefits and limitations. (3) However, if true, this would not mean that investigating SPI would be pointless; quite the contrary. For one, our AIs can only use “things like SPI” if we actually formalize the approach. Second, since AIs can be better at making credible commitments and being transparent to each other, the approach might work better for them.

Framing and Assumptions

We first give a description that aims to keep close to the spirit of [OC]. Afterwards, we follow with a more subjective interpretation of the setting.

SPI: Summary Attempt

Formally, safe Pareto improvements, viewed as games, are defined as follows: find some “subgameS of an original (normal-form) game G, ie, a game that is like the original game, except that some actions are forbidden and some outcomes have “fake” payoffs (ie, different from what they actually are). Do this in such a way that (1) S is “strategically equivalent” [2] to G—that is, if we iteratively remove strongly dominated strategies from G, we get a game G’ that is isomorphic to S; and (2) whenever this isomorphism maps an outcome o in G’ onto outcome p in S, the actual (“non-fake”) payoff corresponding to p (in G) will Pareto-improve[3] on the payoff corresponding to o (in G). This will mean that S is a safe Pareto improvement on G with respect to any policy profile that (A_ISO) plays isomorphic games isomorphically and (A_SDS) respects the iterated removal of strictly dominated strategies. (But we could analogously consider SPIs with respect to a narrower class of policies.)

Blackmail GameThreaten (gun)
& commit
Bluff (gun)Threaten (water gun)
& commit


(water gun)

Ignore threat-100, −200, −3-1, −10, 0
Give in
(real-gun only)
-10, 10-10, 7-1, −10, 0
Give in always-10, 10-10, 7-10, 10-10, 10

SPI on

the Blackmail Game

Threaten (water gun) & commit

but seems like

Threaten (gun) & commit

Bluff (water gun)

but seems like

Bluff (gun)

Ignore threat

-1, −1

but seems like

-100, −20

0, 0

but seems like

0, −5

Give in always

but seems like

Give in (real gun only)

-10, 10

but seems like

-10, 7

-10, 10

but seems like

-10, 7

Figures 1 & 2: A blackmail game and an SPI on it (under the assumption that by default, the threatener will never use water-gun threats and the threatenee will never give in to them).

The setting in which SPIs are used can be succinctly described like this:

  • Environment: N-player (perfect-information, general-sum) normal-form games; communication

  • Assumptions on agent abilities: randomization; voluntary source-code transparency to a chosen degree; in particular: correlation devices; credible commitments; conditioning on revealed source-code of others

  • Non-assumptions: can’t rely on calculating expected utilities

In more detail: We have a “game” between several players. However, each player (principal) uses an agent (representative) that acts on their behalf. Formally, the representatives play a normal-form game. We assume that before playing the game, the agents can negotiate in a further unconstrained manner; eg, they can use a correlation device to coordinate their actions if they so choose. Each of the agents has some policy according to which they negotiate and act—not just in the given situation, but also in all related situations. (We can imagine this as a binding contract, if the agent is human, or a source code if the agent is an AI.) A final assumption is that the agents have the ability to truthfully reveal some facts about their policy to others. For example, the agent could reveal the full policy, make credible commitments like “if we toss a coin and it comes out heads, I will do X”, or prove to the other agent that “I am not telling you how I will play, but I will play the same way that I would in this other game”.
Similarly to [OC], we assume that the agents might be unable to compute the expected utility of their policy profiles (perhaps because they don’t know their opponents’ policy, or they do not have access to a simulator that could compute its expected value). This is because with access to this computation, the agents could use simpler methods than SPI [OC]. However, we don’t assume that the agents are never able to do this, so our solutions shouldn’t rely on the inability to compute utilities either. This will become relevant later, eg, once we discuss how the principals could strategize about which agent to choose as a representative.

The SPI approach is for the principals to design their representatives roughly as follows: give the representative some policy P (that probabilistically selects an action in every normal-form game) that satisfies the assumptions (A_ISO) and (A_SDS). And give it the meta-instructions “Show these meta-instructions to other players, plus prove to them (A_ISO), (A_SDS), and the fact that you have no other meta-instructions. If their meta-instructions are identical to yours, play S instead of G.” The reasoning is that if everybody designs their agent like this, they must, by definition of S, be better off than if they played P directly. (And hence using agents like this is an equilibrium in the design space.)

The interpretation of this design can vary. First, we can view the description literally: we actually program our AI to avoid certain actions and treat (for preference purposes) the remaining outcomes differently than it normally would, conditional on other AIs also doing so. We call this the “surrogate goals interpretation”. Alternatively, we can imagine that the action taken by the representative agent is not a direct action but rather a decision to make an action. In this case, SPI can be viewed as an agreement between the principals—before knowing what actions the representatives select, the principals agree to override dangerous actions by safer ones.[4] We call this override interpretation. This framing also opens up the option for probabilistic deals such as “if our agents would by default get into a fight, randomly choose one to back off”. Formally, this means overriding specific harmful joint actions by randomly chosen safer joint actions. In [OC], this is called perfect-coordination SPI.

Non-canonical Interpretation

For reasons that will become clear in the next sections, we believe that the following non-canonical interpretation of SPI is also useful: Suppose that a group of principals are playing a game. But rather than playing the game directly, each plays through a representative. The representatives have already been dispatched, there is no way to influence their actions anymore, and their decisions will be binding. Suddenly, a Surrogate Fairy swoops down from the sky—until now, nobody has even suspected her existence, or of anything like her. She offers the principals a magical contract that, should they all sign it, will ensure that some suboptimal joint outcomes will, should they arise, be replaced by some other joint outcomes that Pareto-improve on them.

Surrogate Fairy, arriving just in time to save the day

Potential Objections to SPI

The main objection we discuss is that the proposed framing for SPI is unrealistic since it assumes that the agent’s policy does not take into account the fact that SPI exists. We also briefly bring up several other potential objections:

  • SPI does not help with bargaining.

  • SPI might fail to generalize to imperfect-information settings.

  • If SPI works, why don’t humans already do something similar?

  • SPI might empower bad actors.

There are other potential objections such as: agents might spend resources on protecting the surrogate goal, making them less competitive; agents might refuse to use SPI for signalling purposes; It might be hard to make the approach “threatener-neutral”; it might be hard to make SPI work-as-intended for repeated interactions. However, we believe that all of these are either special cases of our main objection or closely related to it, so we refrain from discussing them for now.

Illustrating our Main Objection: Unrealistic Framing

We claim that to make SPI work fully-as-intended, we need one additional assumption (A_SF): outside of the situation to which SPI applies, the representatives and principals must act, at least for strategic purposes, as if the SPI approach did not exist.[5] (“They must not know that the Surrogate Fairy exists, or at least act as if they didn’t.”) Note that we do not claim that if (A_SF) is broken, SPI won’t work at all—merely that it might not work as well as one might believe based on reading [OC]. To explain our reasoning, let us give a bit more context:

The reason why SPI should be mutually beneficial is that thanks to (A_ISO) and (A_SDS), the players will play the subgame S “the same way” they would play the original game G. And since for any specific G-outcome, the isomorphic S-outcome is a Pareto improvement, the players can only become better off by agreeing to use SPI. However, (A_ISO) and (A_SDS) only ensure that they play S “the same” they would play G in that specific situation—that is, their policy does not depend on whether the other player agrees to the SPI contract. However, the agent’s policy can in theory depend on whether the agent is in a world where “SPI is a thing”.

Unfortunately, the policy’s dependency on SPI existing can be extremely hard to avoid or detect. For example, a principal could update on the fact that SPI is widely used and instruct their representative AI to self-modify to be more aggressive (now that the risks are lower) in a way that leaves no trace of this self-modification. Alternatively, the principal could one day learn that SPI is widely used and implement it in all of their representatives. And sometime later, they might start favouring representatives that bargain more aggressively (since the downsides to conflict are now smaller). Critically, this choice could even be subconscious. In fact, there might be no intention to change the policy whatsoever, not even a subconscious one: If there is an element of randomness in the choice of policies, selection pressures and evolutionary dynamics (on either representatives or the principals who employ them) might do the rest.

Motivating story: A game played between a caravan and bandits. By default, the caravan profit is worth $10, but both caravan and bandits need to eat, which costs each of them $1. Bandits can leave the caravan alone and forage (in which case they don’t need to spend on food). Or they can ambush the caravan and demand a portion of the goods.[6] If the caravan resists, all the goods get destroyed and both sides get injured. If the caravan gives in, $2 worth of goods gets trampled in the commotion and they split the rest.

The Basic Setup

Consider the following matrix game:

G; Bandit gameDemandLeave alone
Give in3, 39, 0
Resist-2, −29, 0

The game has two types Nash equilibria: the pure equilibrium (Give in, Demand) and mixed equilibria (Resist with prob. >= 35, Leave alone). Assume that (Give in, Demand) is the default one (why bother being a bandit if you don’t ambush anybody).

To analyze safe Pareto improvements of this game, we extend it by two actions that are theoretically available to both parties, but unlikely to be used:

G’; extension of GDemand (armed)Leave aloneDemand unarmed
Give in (armed only)3, 39, 09, −1
Resist-2, −29, 09, −1
Give in always3, 39, 04, 4

The assumption (A1), that neither of the two new actions gets played, seems reasonable: For the caravan, “Give in (armed only)” weakly dominates “Give in always”—they have no incentive to use the latter. And for any fixed caravan strategy, the bandits will be better off playing either “Demand (armed)” or “Leave alone” instead of “Demand unarmed”.

We now consider the following subgame[7] of this extension.

; SPI on G’Demand unarmedLeave alone
Give in always3, 39, 0
Resist-2, −29, 0

If we further assume that (A2) both sides play isomorphic games isomorphically, will be a safe Pareto improvement on G’. Indeed, the “Leave alone” outcomes remain the same while both (Give in always, Demand unarmed) and (Resist, Demand unarmed) are strict Pareto improvements on (Give in (armed only), Demand (armed)) and (Resist, Demand unarmed).

The Meta-Game

Consider now the above example from the point of view of the bandit leader and the merchant who dispatches the caravan. The bandit leader is deciding whether to use the SPI or not. Both sides are deciding which policy to assign to their agent and whether the agent should sign the SPI contract or not. But we will assume that the caravan is always on board with using the SPI (which seems reasonable) and the bandits will never use the “Leave alone” action (which we revisit later). The resulting meta-game looks as follows:

M; Meta-game

don’t use SPI

(& Instr. to Demand)

use SPI

(& Instr. to Demand)

Instruct to give in (& use SPI)3, 34, 4
Instruct to resist (& use SPI)-2, −29, −1

M has a single Nash equilibrium: (Instruct to resist, use SPI). We discuss the possible implications of this in the next section

Embedding in a Larger Setting

To see whether the introduction of SPI is beneficial or not, we frame the problem a bit broader than we did above. First, we argue that SPI should not incentivize the players to opt out of participating. Second, we will see that whether the players use SPI or not depends on the larger setting in which G and M appear.

Opting-out is Undesirable: First, we discuss the opt-out option: In the merchant’s case, this means not operating on the given caravan route (utility 0). In the bandit leader’s case, it means disbanding the group and becoming farmers instead (utility 1):

M’; extension of Mdon’t use SPIuse SPIopt-out
Instruct to give in3, 34, 49, 1
Instruct to resist-2, −29, −19, 1
opt-out0, −10, −10, 1

The Nash equilibrium of this game is (Instruct to resist, opt-out), which seems desirable. However, consider the scenario where the bandits are replaced by, eg, rangers who keep the forest clear of monsters (or an investor who wants to build a bridge over a river). We could formalize it like the game M above, except that the (9,1) payoffs corresponding to the ( _ , opt-out) outcomes would change to (-1, 1). The outcome (opt-out, opt-out) becomes the new equilibrium, which is strictly worse than the original (3, 3) outcome:

M_a; alternative to M’don’t use SPIuse SPIopt-out
Instruct to give in3, 34, 4-1, 1
Instruct to resist-2, −29, −1-1, 1
opt-out0, −10, −10, 1

Broader Settings: The meta-behaviour in the bandit example strongly depends on the specific setting. We show that SPI might or might not get adopted, and when it does, it might or might not change the policy that the agents are using.

Nash equilibrium, Evolutionary dynamics: First, consider the (unlikely) scenario where the game M’ is played in a one-shot setting between two rational players. Then the players will play the Nash equilibrium (Instruct to resist, opt-out).

Second, consider the scenario where there are many bandit groups and many merchants, each of which can use a different (pure) strategy. At each time step, the two groups get paired randomly. And in the next step t+1, the relative frequency of strategy S will be proportional to weight_t(S)*exp(utility of S against the other population at t). Under this dynamic, the only evolutionary stable strategy is the NE above. (Intuitively: Using SPI outcompetes not using it. Once using SPI is prevalent, resiting outcompetes giving in. And if opting-out is an option, the column population will switch to it. Formally: Under this evolutionary dynamic, evolutionarily stable states coincide with Nash equilibria.)

Infinitely repeated games without discounting (ie, maximizing average payoffs): Here, many different (subgame-perfect) Nash equilibria are possible (corresponding to individually rational feasible payoffs). One of them would be (Instruct to give in, use SPI), but this would be backed up by the bandits threatening the repeated use of “don’t use SPI” should the merchants deviate to “Instruct to resist”. In a NE, this wouldn’t actually be used.

Translucent or pre-committing bandits: We can also consider a setting—one-shot or repeated—where the bandits can precommit to a strategy in a way that is visible to the merchant player. Alternatively, we could assume that the merchant player can perfectly predict the bandits’ actions. Or that the bandits are translucent—they are assumed to be playing a fixed strategy “Don’t use SPI”, they have the option to deviate to “Use SPI”, but there is a chance that if they do, the merchant will learn this before submitting their action.

In all of these settings, the optimal action is for the bandits to not use SPI (because it forces the caravan to give in).

The “Other” Objections

This rejection of (A_SF) complicates our models but allows us to address (at least preliminarily) some of the previously-mentioned objections:

Objection: What if the agents spend resources on protecting the surrogate goal and this makes them less competitive.

Reaction: For some designs, this might happen. For example, an agent who is equally afraid of real guns and water guns might start advocating for a water-gun ban.[8] But perhaps this can be avoided with a better design? Also, note that this might not apply to the “action override” implementation mentioned in the introduction.

Objection: Agents might refuse the SPI contract as a form of signalling in incomplete information settings.

Reaction: This seems quite likely—in a repeated setting, I might refuse some SPI contracts that would be good for me, hoping that the other agent will think that they need to offer me a better deal (a more favourable SPI) in the next rounds. More work here would be useful to understand this.

Objection: It might be hard to make the approach “threatener-neutral” (ie, to keep the desirability of threat-making the same as before).

Reaction: In the Surrogate Fairy setting, SPI is threatener-neutral by definition, since the agents use the same policy. However, without (A_SF), we saw that one party can indeed be made worse off—which can be viewed as a kind of threatener non-neutrality. One way around this could be to subsidize that party through side-payments. (But the details are unclear.)

Finally, since we are already considering a larger context around the game, it is worth noting that there are additional tools that might help with SPI. EG, apart from side-payments, we could design agents which commit to not change their policy for a certain period of time. (Incidentally, the observation that these tools might help with making SPI work is one reason for the author’s intuition that SPI will “add up to normality”.)

SPI and Bargaining

One limitation of SPI is that even if all players agree to use it, they are still sometimes left with a difficult bargaining problem. (However, as [OC] points out, the consequences of bargaining failure might be less severe.) This can arise at two points.

SPI Selection

First, the players need to agree on which SPI to use. For example, suppose two players are faced with a “wingman dilemma” [9] : who gets to be the cool guy today?

Wingman DilemmaWingmanCool guy
Wingman1, 10, 3
Cool guy3, 00, 0

This game has perfect-coordination SPI where is replaced by for some , yielding expected utilities . (We could also replace , but this wouldn’t change our point.)

perf. coord. SPI on WDWingmanCool guy
Wingman1, 10, 3
Cool guy

3, 0

but seems like
0, 0

In the program meta-game [OC], we can consider strategies of the form “demand x”:
:= if the opponent’s program isn’t of the form for some , play Cool guy
if opponent plays s.t. , play Cool guy
if opponent plays s.t. , coordinate with them on achieving
expected utility for some random compatible .

The equilibria of this meta-game are for any . In other words, introducing SPI effectively transforms WD into the Nash Demand game. Arguably, this particular situation has a simple solution: “don’t be a douche and use ” (ie, the Schelling point). However, Section 9 of [OC] indicates that SPI selection can correspond to more difficult bargaining problems.

Bargaining in SPI

Even once the agents agree on which SPI to use, they still need to agree on which outcome to choose. For example, consider a game G with actions , , utilities and , where is a subset of the first quadrant of a circle. Since none of the outcomes , can be Pareto-improved upon, all of these options will remain on the Pareto-frontier of any SPI.

0, 00, 0
0, 00, 0
0, 00, 0

Imperfect Information

For the purpose of this section, we will only consider the simplest case of “Surrogate Fairy”, where everything that the agents are concerned with is how to play the game at hand and whether to sign the SPI contract or not.

Imperfect information comes in two shapes: incomplete information (uncertainty about payoffs) and imperfect but complete information (where players know each others payoffs, but the game is sequential and some events might not be observable to everybody).

First, let us observe that under (A_SF), we only need to worry about incomplete information. Indeed, any complete information sequential game G has a corresponding normal-form representation N. So as long as we don’t care about computational efficiency, we can just take whatever SPI tool we have an apply it to N. In theory, this could create problems with some strategies not being subgame-perfect. However, recall that we assume that agents can make precommitments. As a result, subgame-imperfection does not pose a problem for complete information games.

However, incomplete information will still pose a problem. Indeed, this is because with incomplete information, an agent is essentially uncertain about the identity of the agent they are facing. For example, perhaps I am dealing with somebody who prefers conflict? Or perhaps they think that I might prefer conflict—maybe I could use that to bluff? Importantly, for the purpose of SPI, the trick “reduce the game to its normal-form representation” doesn’t apply, because the agents’ power to precommit in incomplete information games is weaker. Indeed, while the agent can make commitments on the account of its future versions, it will typically not have influence over its “alternative selves”.[10] As a result, we must require that the agents behave in a subgame-perfect manner.

Incomplete inf. WDWingmanCool guy
Wingman1, 10, 3
Cool guy3, 0x, y

are drawn from some common knowledge distribution but the exact value is private to player 1, resp. 2

We see several ways to deal with incomplete information. First, the agents might use some third-party device which proposes a subgame S (or a perfect-information token game) without taking their types into account. Each player can either accept S or turn it down—and under (A_SF), it will get accepted by all players if and only if it is an SPI over the game that corresponds to their type. Notice that there will be a tradeoff between fairness and acceptance rate of S—for example, in the game above, suppose that always but can be either 0 (90%) or 2 (10%). We can either get a 50% acceptance rate by proposing the fair 50:50 split, or a 100% acceptance by proposing an unfair 1:2 split. (The optimal design of such third-party devices thus becomes an open problem.)

Second, the agents might each privately reveal their type to a trusted third-party which then generates a subgame that is guaranteed to be SPI for their types. (EG, it could override some joint outcomes by others that Pareto-improve upon them, without revealing the specific mapping ahead of time to prevent leaking information.) Under (A_SF), this approach should always be appealing to all players

Third, the agents might bargain over which subgame to play instead of G. (Alternatively: over which third-party device to use. Or the devices could return sets of subgames and the players could then bargain over which one to use.) For this to “fully” work under (A_SF), the players “only” need to guarantee that their subgame-policy does not depend on information learned during the bargaining phase. Going beyond (A_SF), it might additionally help if the policies for playing subgames do not know which subgame they are playing; this seems simple to achieve in the override interpretation but difficult in the SG interpretation.

Note that if we completely abandon (A_SF) and assume that agents interact repeatedly, the decision to accept a subgame or not might update the other agents’ beliefs about one’s type. As a result, the agents might refuse a subgame that is an SPI in an attempt to mislead the others. This behaviour seems inherent to repeated interactions in incomplete settings and we do not expect to find a simple way around it.

If SPI Works, Why Aren’t Humans Already Using It?

This is an informal and “outside-view” objection that stems from the observation that to the author, it seems that nothing like SPI is being used by humans. One might argue that SPI requires transparent policies, and humans aren’t transparent. However, humans are often partially transparent to each other. Moreover, legal contracts and financial incentives do make their commitments at least somewhat binding. As a result, we expect that if SPI “works” for AIs, it would already work for humans, at least to a limited extent. Why doesn’t it?

One explanation is that humans in fact do implicitly use things like SPI, but they are so ingrained that it is difficult for us to notice the similarities. For example, when we get into a shouting match but refrain from using fists, or have a fistfight but don’t murder each other, is that an SPI? But perhaps I only refrain from beating people up because it would get me in trouble with the police (or make me unpopular)? But could this still be viewed as a social-wide method for implementing SPI? Overall, this topic seems fascinating, but too confusing to be actionable. We view observation of this sort as a weak evidence for the hypothesis that SPI will “add up to normality”.

However, this leaves unanswered the question of why humans don’t (seem to) use SPI explicitly. We might expect to see this when people reason analytically and the stakes are sufficiently high. Some areas which pattern-match this descriptions are legal disputes, deals between companies or corporations, and politics. Why aren’t we seeing SPI-like methods there? The answer for legal disputes and politics is that SPI very well might be used there—it is just that the author is mostly ignorant about these areas. (If somebody more familiar with these areas knows about relevant examples or their absence, these would definitely provide useful intuitions for SPI research.) For business deals, the answer is different: SPI isn’t used there because in most countries, it would be a violation of anti-trust law. Indeed, conflict between companies is often precisely the thing that makes the market better for the customers. (However, there might be other forms of conflict, eg negative advertisement, that everybody would be happy to avoid.) Finally, another reason why we might not be seeing explicit use of SPI is that SPI often applies to settings that involve blackmail. People might, therefore, refrain from using SPI lest they acknowledge that they do something illegal or immoral.

Empowering Bad Actors

As a final remark, the example of companies using SPI to avoid costly competition (eg, lowering prices) should make us wary about the potential misuses of SPI. However, this concern is not new or specific to SPI—rather, it applies to cooperative AI more broadly.

Questions & Suggestions for Future Research

disclaimer: disorganized and unpolished

Questions related to the current text:

  • Can we find a setting, similar to the “bandits become farmers” setting, where SPI makes things very bad?

  • Can we find a setting where threats get carried out despite introducing SPI? Or perhaps even because of it?

More general questions:

  • Assuming (A_SF), work out details of how SPI would actually apply to some real-world problem. (To identify problems we might be overlooking.)

  • Describe a useful & realistic setting for SPI without (A_SF). Identify open problems for that setting.

  • Meta: Is it possible to make SPI that helps with the “bad threats” but doesn’t disrupt the “good threats” (like taxes, police, and “no ice cream for you if you leave your toys on the floor”)?


We presented a preliminary analysis of several objections to surrogate goals and safe Pareto improvements. First, we saw that to predict whether SPI is likely to help or not, we need to consider larger context than just the one isolated decision problem. More work in this area could focus on finding useful framings in which to analyze SPI. Second, we observed that SPI doesn’t make bargaining go away—the agents need to agree on which SPI to use and which outcome in that SPI to select. Finally, we saw some evidence that SPI might work even in incomplete information settings—however, more work on this topic is needed.

This research has been financially supported by CLR. Also, many thanks to Jesse Clifton, Caspar Oesterheld, and Tobias Baumann and to various people at CLR.


[1] Wild guess: 75%.

[2] Strategic equivalence could be formalized in various different ways (eg, also removing some weakly dominated strategies). What ultimately matters is that the agents need to treat them as equivalent

[3] In accordance with [OC], we will use “Pareto-improvement” to include outcomes that are equally good.

[4] In practice, AI representatives could make this agreement directly between themselves. This variant makes a lot of sense, but we avoid it because it seems too schizophrenic to generate useful intuitions.

[5] The paper [OC] is not unaware of the need for this assumption. We chose to highlight it because we believe it might be a crux for whether SPI works or not. This is a potential disagreement between this text and [OC].

[6] If you think the bandits would want to take everything, imagine that 50% of the time the caravan manages to avoid them.

[7] The interpretation is that the players pick actions as if they were playing the original game, but then end up replacing “Demand” by “Demand unarmed” and “Give in” by “Give in always”. For further details, see Caspar’s SPI paper.

[8] Note that a non-naive implementation will only be afraid of water-guns if it signs the SPI contract with the person holding them. In particular, no, the agent won’t develop a horrible fear of children near swimming pools.

[9] As far as the payoffs go, this is a Game of Chicken. However, the traditional formulation of Chicken gives the intuition that coordination and communication is not the point.

[10] We could program all robots of the same type to act in a coordinated manner even if they are used by principals with different utilities. However, coordinating with a completely different brand of AI might be unlikely. Finally, what if the representative R1 is only one of a kind and the uncertainty over payoffs is only present because the other representative R2 has a belief over the goals of principal P1 (and this belief is completely detached from reality, but nevertheless common knowledge). In such setting, it would be undesirable for R1 to try “coordinating” with its hypothetical alternative selves who serve the hypothetical alternative P1s.