Reflective oracles and superationality

This grew out of an exchange with Jessica Taylor during MIRI’s recent visit to the FHI. Still getting my feel for the fixed point approach; let me know of any errors.

People at MIRI have recently proved you can use reflective oracles so that agents can use them to reason about other agents (including other agents with oracles) and themselves, and consistently reach Nash equilibriums. But can we do better than that?

To recap, a reflective oracle is a machine O such that:

P(A()=1)>p implies O(A,p)=1
P(A()=0)>1-p implies O(A,p)=0

This works even if A() includes a call to the oracle within its code. Now, all the algorithms used here will be clearly terminating, so we’ll have the other two implications as well (eg (P(A()=0)>p implies O(A,p)=0). And given any δ, we can, with order log(1/δ) questions, establish the probability of A() to within δ. Thus we will write O(A()==a)=p to mean that O(A()==a,(n-1)δ/2)=1 and O(A()==a,(n+1)δ/2)=0, where (n-1)δ/2 < p < (n+1)δ/2.

Note also that O can be used to output a probabilistic output (to within δ), so outputting specific mixed strategies is possible.

If p1 and p2 are two probability distributions/strategies over possible agent outputs, define them to be “δ-Pareto” if they are within δ of Pareto strategies. We can differentiate (p1,p2) for small changes in strategy, by infinitesimally increasing the weight of some pure strategies o1 and o2 (note that for each strategy, we actually have one less independent degree of freedom than the number of pure strategies, since probabilities must sum to 1). We’ll say D(p1,p2)(o1,o2) is Pareto if we are sure that it is an improvement for both players for all possible (p1’,p2’) within the respective δ ranges of (p1,p2).

If p1 and p2 are two probability distributions over possible agent outputs, we can consider

We’ll make the following two assumptions:

The players do not make use of each other’s internal structures, just of the possible outputs and the calls to O().
The players do not have access to a joint source of randomness.

So now fix 1 >> ε >> δ > 0, assume the problem is the Prisoner’s dilemma as given in the previous figure, let Q() be the other player, and consider the following algorithm:

define LP():
For all outputs o1 of LP(), compute O(LP()==o1).
Call the probability distribution generated p1.
For all outputs o2 of Q(), compute O(Q()==o2).
Call the probability distribution generated p2.
If exists o1 and o2 such that D(p1,p2)(o1,o2) is δ-Pareto:
-With probability 1-ε output p1.
-With probability ε output o1.
Else output p1.

Now, what does this algorithm do? To answer that, we need to consider its fixed points. Since ε >> δ, the only possible fixed points are those where the “With probability 1-ε…” output does not happen (even the degenerate case p1=o1 is not possible, as then D(p1,-)(o1,-)=0 since the probability of o1 cannot be further increased).

Thus (p1,p2) must be strategies such that there do not exist o1, o2 making D(p1,p2)(o1,o2) δ-Pareto. In the prisonner’s dilemma, there is always a possibility of Pareto improvement by increasing mutual cooperation (o1,o2)=(C,C), so p1 and p2 must themselves be δ-Pareto. Thus LP() will always reach δ-Pareto outcomes with Q(). The name LP stands for “locally Pareto” (we’ll revisit the “locally” later).

Though LP() achieves Pareto outcomes, this is not always ideal. If Q()==LP(), they will achieve some Pareto outcome, but it could be any. If Q() is the cooperation rock, then p1 could be any mix of defect and cooperate, as all those outcomes are Pareto. More worryingly, if Q() is the defection rock, LP() must cooperate (to within δ), as that is the only Pareto outcome.

To deal with this, consider neLP() (non-exploitable LP()). Define O(p1,p2) as the module computing the two probability distributions. The value of (p1,p2) is the expected utility of these according to the agent. If we say the value is δ-surely less than some other number, that means that value of (p1’,p2’) is strictly less than that number for all possible p1’ and p2’ within δ of p1 and p2, respectively.

define neLP():
O(p1,p2)
Let min be the minmax value, from strategy p1'.
If the value of (p1,p2) is δ-surely less than min:
-Output p1'.
If exists o1 and o2 such that D(p1,p2)(o1,o2) is δ-Pareto:
-With probability 1-ε output p1.
-With probability ε output o1.
Else output p1.

If multiple o1 exist, it chooses randomly among them.

For the prisoner’s dilemma, p1’=D and min=1. When playing against defection rock, p2 must be within δ of pure defection. The “If the value of…” clause prevents neLP() from cooperating with a probability larger than order δ. Therefore, neLP() will compute a (p1,p2) that, most of the time, will cause it to defect (“Output p1’”), and around δ/ε of the time, to go through the “If exists” loop, and cooperate with probability ε, resulting in cooperation of order δ.

What happens when neLP() plays itself? The two players must have either the same values for the probabilities in p1 and p2, or values that are δ/2 apart. The two “probability zones” computed by the two players must thus touch, at a corner is nothing else. The first player will think the outcomes are δ-surely less than min if its “zone” has value strictly less than 1; conversely for the second player. Thus the touching point must have coordinates (a,b) with a<1 and b<1 - but such a point does not exist in the prisoner’s dilemma outcome space.

So at least one player must reach the “If exists…” clause. But, as before, for the prisoner’s dilemma, strategies that trigger that clause are not stable. So the players must reach a δ-Pareto outcome. By the earlier conditions, this must be one that is not δ-less than (D,D) for either player. Consequently, it must be on the red boundary of the following figure:

The red boundary is what neLP() can achieve against copies of itself. The combined red and blue boundary is what neLP() can achieve against LP() and cooperation rock.

Can we do better? We might be tempted to increase “min”. If we increased min to 2, say, then surely the result would be Pareto in the “greater than 2” region? Have we successfully moved from “non-exploitable” to “exploits”?

However, though this works against LP() and cooperation rock, it runs into trouble when playing against itself, as “you both defect” becomes a possible oracle answer. A better solution is to define the “allowable region” as the green one here:

Thus the “If the value of…” line is replaced with “If the value of (p1,p2) is δ-surely not in the green zone” and the argument goes through as before. If such an agent faces a copy of itself, the combined allowable region is the kite delimited by the darker green lines, and then the outcome will be along the Pareto light green lines.

The “green” agent will even be able to cooperate with the “yellow” agent, the one whose allowable region is the yellow triangle. Since their allowable regions overlap, the outcome will be the Pareto segment at the top of the overlap.

However, two “yellow” agents will not be able to cooperate, and will mutually defect. By becoming too greedy, and insisting on a higher share of the prize, they’ve made mutual cooperation impossible. This seems to be a general trend: to make yourself better against some agents, you make yourself worse against others.

But the prisoner’s dilemma is an unusual kind of problem: the space of all possible expected outcomes can be reached by individual players choosing independent mixed strategies.

To see that this need not be the case, consider the variant of the PD where the rewards for defection are much larger:

The background grey triangle is the space of all possible expected outcomes. However, by choosing independently, the agents can only reach the purple area. All previous agents that reached a Pareto outcome—LP(), neLP(), Green-neLP() and Yellow-neLP() (apart from Yellow-neLP() against itself) - will still reach a “Pareto” outcome here, where Pareto means that it cannot be mutually improved without a joint source of randomness.

Now this PD variant is bad enough, but there are some problems where the “Pareto” boundary is very far from the ideal one. Consider this variant of the coordination problem:

Unless one player gets almost all their utility, the outcome is likely to be very suboptimal.

But things can be even worse than that. Consider the stag hunt problem:

Here there is also an area that can’t be reached, but since it’s Pareto-dominated by the area that can be reached, that isn’t the problem. The problem is that at the (Hare,Hare) equilibrium point at (3,3), there are no local Pareto improvements. Any increase in the probability of Stag, by either or both players, cannot increase the joint utilities.

The LP(), neLP() agents can therefore converge to (H,H) as an equilibrium.

Indeed, it’s only when one player has a probability of Stag greater than ³⁄₄ (or both with probability greater than ³⁄₈), that it becomes possible for there to be local Pareto improvements in the direction of (S,S). It’s perfectly possible to modify LP() into P() (no longer local) so that it increases S (and thus decreases its utility) if the probability of S is below ³⁄₈.

If this happens, then two copies of P() will reach (S,S) on the stag problem. If P() is willing to increase the probability of S up to ³⁄₄, then it will reach (S,S) against LP(). But that means that P() has laid itself open to ending up at a value of (3/4)0+(1/4)3 = ³⁄₄ if it’s faced with a Hare-rock. Again, to make things better against some agents, one has to make things worse against others.

What if the agents has some joint source of randomness, say a process R() that takes certain values, which they will see before their final decisions? Assume that R() is δ²-complete, in that we can divide the set of outcomes from r into 1/δ² disjoint sets, each with probability less than δ² + δ⁵ (we use δ⁵ so that, even when summed over all δ² sets, the error remains bounded by δ³).

Let p be a combined probability distribution over the two agents and R(). Given the fineness of R(), we can “ε-move” p towards any point on the true Pareto boundary, with resolution δ. To see that, let A=q(o1,o2)+(1-q)(o1’,o2’) be the target point on the Pareto boundary. Then choose randomly a set R1 of outcomes of R(), of probability εq, and a set R2 of probability ε(1-q). Replace p with p’, where p’=(o1,o2) on R1, p’=(o1’,o2’) on R2, and p’=p otherwise. This obviously ε-moves p-{R1,R2} towards A, and since R1 and R2 were chosen randomly, the whole process moves p towards A.

Then we can define the GP() (Globally-Pareto) agent:

define GP():
For all outputs o1,o2,r of (GP(),Q(),R()),
compute probabilities.
Call the probability distribution generated p.
If p is not δ-Pareto:
-ε-move p towards some point on the Pareto boundary.
-output the ε-moved p.
Else output p.

Of course, “output p” means output the GP() portion of p, and this choose randomly among points on the Pareto boundary.

This must reach a globally Pareto point, and, of course, there are also various neGP() variants.