StrivingForLegibility comments on Incorporating Mechanism Design Into Decision Theory

StrivingForLegibility 2 Feb 2024 0:36 UTC
3 points
0
Totally! The ecosystem I think you’re referring to is all of the programs which, when playing Chicken with each other, manage to play a correlated strategy somewhere on the Pareto frontier between (1,2) and (2,1).
Games like Chicken are actually what motivated me to think in terms of “collaborating to build mechanisms to reshape incentives.” If both players choose their mixed strategy separately, there’s an equilibrium where they independently mix ( $\frac{1}{3}$ , $\frac{2}{3}$ ) between Straight and Swerve respectively. But sometimes this leads to (Straight, Straight) or (Swerve, Swerve), leaving both players with an expected utility of $\frac{2}{3}$ and wishing they could coordinate on Something Else Which Is Not That.
If they could coordinate to build a traffic light, they could correlate their actions and only mix between (Straight, Swerve) and (Swerve, Straight). A ⁵⁰⁄₅₀ mix of these two gives each player an expected utility of 1.5, which seems pretty fair in terms of the payoffs achievable in this game.
Anything that’s mutually unpredictable and mutually observable can be use to correlate actions by different agents. Agents that can easily communicate can use cryptographic commitments to produce legibly fair correlated random signals.
My impression is that being able to perform logical handshakes creates program equilibria that can be better than any correlated equilibrium. When the traffic light says the joint strategy should be (Straight, Swerve), the player told to Swerve has an incentive to actually Swerve rather than go Straight, assuming the other player is going to be playing their part of the correlated equilibrium. But the same trick doesn’t work in the Prisoners’ Dilemma: a traffic light announcing (Cooperate, Cooperate) doesn’t give either player an incentive to actually play their part of that joint strategy. Whereas a logical handshake actually does reshape the players’ incentives: they each know that if they deviate from Cooperation, their counterpart will too, and they both prefer (Cooperate, Cooperate) to (Defect, Defect).
I haven’t found any results for the phrase “correlated program equilibrium”, but cousin_it talks about the setup here:
AIs that have access to each other’s code and common random bits can enforce any correlated play by using the quining trick from Re-formalizing PD. If they all agree beforehand that a certain outcome is “good and fair”, the trick allows them to “mutually precommit” to this outcome without at all constraining their ability to aggressively play against those who didn’t precommit. This leaves us with the problem of fairness.
This gives us the best of both worlds: the random bits can get us any distribution over joint strategies we want, and the logical handshake allows enforcement of that distribution so long as it’s better than each player’s BATNA. My impression is that it’s not always obvious what each player’s BATNA is, and in this sequence I recommend techniques like counterfactual mechanism networks to move the BATNA in directions that all players individually prefer and agree are fair.
But in the context of “delegating your decision to a computer program”, one reasonable starting BATNA might be “what would all delegates do if they couldn’t read each other’s source code?” A reasonable decision theory wouldn’t give in to inappropriate threats, and this removes the incentive for other decision theories to make them towards us in the first place. In the case of Chicken, the closed-source answer might be something like the mixed strategy we mentioned earlier: ( $\frac{1}{3}$ , $\frac{2}{3}$ ) mixture between Straight and Swerve.
Any logical negotiation needs to improve on this baseline. This can make it a lot easier for our decision theory to resist threats. Like in the next post, AliceBot can spin up an instance to negotiate with BobBot, and basically ignore the content of this negotiation. Negotiator AliceBot can credibly say to BobBot “look, regardless of what you threaten in this negotiation, take a look at my code. Implementer AliceBot won’t implement any policy that’s worse than the BATNA defined at that level.” And this extends recursively throughout the network, like if they perform multiple rounds of negotiation.