Payorian cooperation is easy with Kripke frames

transhumanist_atom_understander9 Mar 2026 0:29 UTC

LW: 70 AF: 22

Payor's Lemma AI Logic & Mathematics Agent Foundations

The context is MIRI’s twist on Axelrod’s Prisoner’s Dilemma tournament. Axelrod’s competitors were programs, facing each other in an iterated Prisoner’s Dilemma. MIRI’s tournament is a one-shot Prisoner’s Dilemma, but the programs get to read their opponent’s code. Or, rather, a description of the behavior of the code in Gödel-Löb provability logic, which turns out to be enough to determine their behavior in the setup.

One fun result, right in the beginning of the paper, is about a program, FairBot, whose behavior is specified by “I’ll cooperate with you if you (provably) cooperate with me”. Despite the appearance of circularity, FairBot cooperates with itself. The proof involves Löb’s theorem, so we call this Löbian cooperation.

Andrew Critch has suggested another way of proving self-cooperation. Instead of Löb’s theorem, we use what he calls “Payor’s lemma”. It suggests a different way of defining a FairBot, something more like “If, when I hypothetically cooperate with you, you would cooperate with me, then I really will cooperate.”

This post is my attempt to explain why I think this approach is more promising, or at least why I like it more.

When thinking through these kinds of reflection problems with modal logic, I use Kripke frames, which are directed graphs of possible worlds. My fantasy is that when we’ve figured out this logical decision theory stuff, the reasoning will involve Kripke frames that are intuitively interpretable as something like a game tree, or paths through a game tree. Like, “if I do this, he’ll do that, because he knows I would do the other thing. But if I do that...”

With Löbian cooperation, I got nowhere with this. But Critch’s post was the first glimmer of the fulfillment of my fantasy. The only way to communicate to you what I see in it, though, is to walk you through a proof of self-cooperation with Kripke frames.

Löbian FairBot

First, reviewing the Löbian FairBot and Löbian cooperation, without Kripke frames.

We consider a program that takes another program as input, and outputs True to cooperate with it or False to defect. So, means that cooperates when the opponent is .

For FairBot, we have

That is, FairBot cooperates if and only if its opponent provably cooperates.

For an example of Löbian cooperation, consider FairBot playing against itself. Replacing with :

This is a Henkin sentence, that is, a sentence asserting its own provability:

Such sentences are in fact provable. This was established by Löb, in the paper that proved what we now call Löb’s theorem (some history from SEP). It gets more interesting when considering two FairBots with distinct code, which turn out to also cooperate with each other, but let’s move on to another kind of FairBot.

Payorian FairBot

Let’s call what we were discussing before a “Löbian FairBot”, and next, we’ll consider a “Payorian FairBot”. I still want to call it a “FairBot”, since I think the following does express a kind of fairness:

Actually, I’d rather drop all these parentheses, and say that means the FairBot cooperates, and means its opponent cooperates. If you like, and . Then the condition is:

Instead of checking whether the opponent cooperates, we check whether FairBot’s provable cooperation implies the opponent cooperates. But be careful about “implies”; it’s material implication, and doesn’t denote real dependence. For example, if the opponent unconditionally cooperates, then this implication is trivially true—we’ll cover this case in more detail later.

The Payorian FairBot looks more complicated. But reasoning about it is much simpler, at least with Kripke frames.

Kripke frames

In fact, it’s easy enough that I think it’s a suitable first exercise in using Kripke frames. So I hope you’ll keep reading even if you’ve never used them before, and this post can serve as a tutorial. To sell them a bit, I think of them as a tree data structure for tracking quotation levels. In the prisoner’s dilemma tournament, these are base reality, you imagining your opponent, you imagining your opponent imagining you, and so on. They end up at different depths of the tree.

The way they come up in the prisoner’s dilemma tournament is that we resolve matches using provability logic. But instead of writing proofs, we’re going to use the semantics. If you don’t know what I mean by semantics, I hope it helps to say that it’s analogous to resolving questions in propositional logic using truth tables. The semantics of propositional logic are that a sentence can be satisfied or not by a row in a truth table. Or, put another way, a row in a truth table can be a model for a sentence in propositional logic. So what does a model for a sentence of modal logic look like? (Really we’ll be applying this to a collection of sentences, but they can be combined to a single sentence by conjunction).

We’ll start with a Kripke frame, which is a set of possible worlds and a relation on that set. I had always heard this called an “accessibility” relation, but the Arbital page calls it a “visibility” relation, so I’ll stick with that.

In the special case of provability logic, the visibility relation must be transitive. When I draw a Kripke frame like , I’m not going to draw an arrow from to , but is visible from . So the diagrams will be directed graphs, and visibility is reachability following the arrows. The directed graph is I suppose not necessarily a tree, but it’s only in the case of a tree that I have some intuition, where a possible world deeper in the tree is more deeply nested in whatever kind of quotation we’re considering.

A Kripke model is a Kripke frame, plus an assignment of which sentences are true at each possible world. Our strategy for disproving a sentence of provability logic will be to show that it cannot be satisfied by any Kripke model. (And to prove a sentence, we’ll disprove the negation.)

The sentences that we assign to the possible worlds are not just “elementary” sentences of propositional logic, but can include modal sentences (those including or ). Modal sentences at one world have implications for what sentences must be true at other worlds. A claim of possibility requires that be true in a visible world; in a tree, at a node requires at some descendant node (not necessarily a direct child). A claim of necessity requires that be true in all visible worlds; in a tree, at a node requires in all descendants.

Worlds are not necessarily visible from themselves, because this is provability logic so we can’t in general go from to .

Well, that’s not the best beginner’s guide to Kripke frames. If you want a more complete explanation, check out the Arbital page or the SEP. But at least I’ve repeated the facts that I’m going to actually use, so let’s move on to applying these ideas to the Payorian FairBot.

Proving cooperation with Kripke frames, and CooperateBot

Our strategy for proving cooperation is to assume FairBot defects, write down what conditions this puts on a Kripke model, and look for a contradiction. A contradiction means no Kripke model can satisfy the premises, so FairBot must cooperate.

To assume FairBot defects means to assume

Or equivalently,

A statement of possibility. We’ll call the world in which this statement is true , and write what this implies about other possible worlds as so:

We satisfied the proposition that this stuff is possible at the root world by making it true in a visible world.

Anyway, now we can prove that FairBot cooperates with CooperateBot. CooperateBot always cooperates, meaning that is true. The assumption we will add, therefore, is , that is, that is provable. FairBot’s behavior depends upon what it can prove about its opponent, so we will always wrap the description of the opponent’s behavior like this.

at implies in all downstream worlds, which contradicts with at . So there cannot be a Kripke model that simultaneously satisfies FairBot’s non-cooperation and , so FairBot must cooperate.

This behavior justifies the name “FairBot”. It plays fair by cooperating with CooperateBot, rather than playing to win by exploiting it.

Writing this out more carefully. Let’s put these propositions at the root. First, the behavior of our FairBot:

Next, the behavior of its opponent CooperateBot, but wrapped as a statement of provability:

Finally, our assumption that FairBot defects:

I hope you’ll agree that the first statement and the third together are equivalent to what we were putting at the root before.

So to model these statements, we would need a Kripke model where they’re all true at the root:

Since this implies the possibility statement from before, we put in the implied visible world:

Since we have at the root, that implies downstream:

There’s our contradiction. No Kripke model can satisfy the requirements.

One Payorian FairBot will cooperate with another

Now, let’s assume both players are FairBots. Before we play them against each other, let’s just write down what their non-cooperation implies for a Kripke model, as before. Two separate Kripke frames, side by side:

Now, considering a FairBot whose opponent is a FairBot. It’s going to look like grafting the second Kripke frame above onto the first. To set it up, as in the previous section, we put three premises at the root world: the behavior of our FairBot, the behavior of its opponent (wrapped in a provability modality), and the assumption that our FairBot defects.

Contradiction at . That’s our proof that a FairBot cooperates with another FairBot.

But let’s slow down and look at what happened. At , we “unwrapped” the description of the other FairBot. Because we also have at , this world now looks like our from before. So at , we have everything from our previous . But because we had at , we unwrap that at , resulting in a contradiction.

Why this procedure feels right to me

I think of as a world our FairBot is simulating. It’s asking, “if I provably cooperate, what will my opponent do?” When the opponent is itself a FairBot, is familiar: it looks just like when we considered FairBot vs CooperateBot, but with and switched.

I imagine as a simulation done by the simulated opponent FairBot. We avoid an infinite regress, because we have assumed that in this simulation, the simulated opponent finds that the first FairBot cooperates.

This feels like the right depth to me. I imagine you imagining me, and we’re done. That feels like what I actually do when I’m thinking through Newcomb-like problems. In Newcomb’s problem, I think about Omega’s simulation of me, which I suppose is in , if my mental “simulation” of Omega is in .

The sense in which this is simpler than Löbian cooperation

But I feel I haven’t really indicated the appeal until I contrast this to the complexity I encountered with Löbian cooperation. Before I knew about Payorian cooperation, I tried putting these premises at the root world:

I didn’t like having to introduce the Löb sentence . I felt like I was putting it in by hand, and I wanted everything to be automatic. But with these premises, I could just mechanically hang requirements on a Kripke frame until I found a contradiction—in .

In Critch’s post, he also notes that a Payorian approach avoids the auxiliary sentence. And he argues that it is simpler, by counting lines of his proof in a “natural deduction”-type system.

The Kripke frame approach suggests a different way of quantifying complexity: how deep in the Kripke frame do you have to go, before you rule out a consistent Kripke model? For Payorian cooperation, you go two worlds deep, which feels right. For Löbian cooperation, you go four worlds deep, which feels wrong.

I titled this post “Payorian cooperation is easy with Kripke frames”. You could also say that proving cooperation with Kripke frames is easy, once you use Payorian cooperation. On a personal level, that’s the significance. When I first started doing these kinds of proofs a few years ago, I was optimistic about doing logical decision theory this way. I thought proving FairBot cooperates with itself would be the tutorial level, but with a Löbian FairBot, it felt more like a final boss. Stumbling across Critch’s post has restored my confidence in the potential of this approach.

What links here?

Who’s Afraid of Acausal Trades? by edgecase64 (26 Mar 2026 1:53 UTC; 15 points)

transhumanist_atom_understander9 Mar 2026 0:29 UTC

LW: 70 AF: 22

7 comments8 min readLW link

Payor's Lemma AI Logic & Mathematics Agent Foundations

joseph_c 9 Mar 2026 6:11 UTC
9 points
0
I think this can also be used to make a smarter version of PrudentBot. If I remember right, PrudentBot is defined as that is, it cooperates if and only if it can prove that (1) its opponent cooperates with it, and (2) there is some world its opponent defects against DefectBot.

To me, the part about choosing DefectBot in particular to compare with seems a little hacked. What you’re really interested in is seeing if it will defect against you, even if you defect against them. Thus, I propose SharkBot, which I label . SharkBot is defined as It’s like your version of FairBot, but adds another condition, that it’s possible that its opponent’s cooperation leads to the possibility of mutual cooperation. Equivalently, it’s possible that provable defection will lead to the opponent defecting. Thus, SharkBot exploits opponents that don’t change their behavior even when they should know better.

I’m pretty sure this cooperates with itself and FairBot, but defects against CooperateBot and a larger share of stupid agents than PrudentBot does.
- James Payor 10 Mar 2026 14:39 UTC
  9 points
  0
  Parent
  (EDIT: focusing here on PrudentBot. Fwiw I like the idea you have in SharkBot use the weaker diamond when evaluating whether to defect! I’m less well equipped to analyze it at this time, still just grappling with how to handle diamonds at all in my usual proof search ontology.)
  
  I’m less familiar with the diamond modality; but if I can correctly translate “exists some world that satisfies X” as “not every world satisfies not-X”, we get the following:
  This will run into some trouble in the unbounded-proof-length provability logic model; since generally you can’t prove in some fixed length that there will be no proofs of any length of , on pain of unsoundness via a Lobian fixed point. So the version in the paper runs (I believe) as follows:
  This is handling the part where might be FairBot and require “proving that there are no proofs” by checking the defection part under the assumption of one level of soundness. (I remember this gotcha because when I try to write down the PrudentBot that lives in my heart I end up with something closer to your version, but then the paper was doing this carefully different thing...)
  Insofar as it’s really the proof lengths fighting that is the problem, I think something like the following might just work instead:
  Anyway this is all to say that:
  1. I agree that the original PrudentBot definition seems hacky, both in the DefectBot aspect (we can replace this with “doesn’t cooperate with every bot” if that helps), and also in the assuming exactly one level of soundness regard. It is ofc a nice POC given that it does manage to cooperate with itself with this machinery.
  2. In terms of caching the diamond definitions out into provability logic some care seems to be required, and I’m interested in a well-behaved translation. The “not box not” translation doesn’t actually stay very faithful to the intent.
  3. It might be that messing around with proof bounds is sufficient to get there, so you don’t end up with an inner proof length being able to diagonalize an outer proof length.
  - transhumanist_atom_understander 10 Mar 2026 17:56 UTC
    6 points
    −1
    Parent
    If I’m the Payorian FairBot and you’re my opponent, I simulate you under the assumption that there’s a proof I cooperate; this is .
    
    So, to stay in the Payorian spirit, I could also simulate you under the assumption that there’s a proof I defect.
    
    That makes me think that a Payorian PrudentBot would satisfy
    
    Since there’s no third program involved, I can just define and to clean this up a bit:
    
    The reason this looks right to me is because when we start drawing Kripke frames it does look like I (PrudentBot) am doing two simulations of you (my opponent), to see what you do assuming my provable cooperation and defection. Let’s assume . Then (and yes, is ),
    
    I’ll cooperate if this can’t be satisfied, meaning we need to show contradictions in two different sets of requirements on the Kripke model. The first is the one from my post:
    
    That’s the one where I simulate you assuming there’s a proof that I cooperate. We rule this out if, in that world, you cooperate (contradicting the in ).
    
    The other:
    
    This is the one where I simulate you assuming there’s a proof I defect. We rule this one out if, in that world, you defect (contradicting the in ).
    
    So, if you cooperate in the simulation where there’s a proof I cooperate, and defect in the simulation where there’s a proof I defect, we rule these both out. That means we can’t satisfy the conditions imposed by my defection, meaning I cooperate.
    
    That sounds prudent.
    
    As for the Löbian PrudentBot testing its opponent on DefectBot and how that seems arbitrary. I mentioned in the FairBot vs FairBot section of my post that looks suspiciously like I’m simulating you facing CooperateBot. This was troubling to me because I shouldn’t care whether my opponent exploits CooperateBot. Similarly, in this proposed Payorian PrudentBot, looks suspiciously like simulating my opponent against DefectBot.
    
    However, I think it’s not really? Although that brings to mind something missing from my post. In my post, when I’m simulating you, I never actually give you the full description of myself. I think a general version of this method requires an additional premise, repeating the description of myself but wrapped in a provability modality so it’s accessible in the simulations. Then my simulated opponent in doesn’t see a CooperateBot; rather, it sees whatever bot I am, but we assume there’s a proof I cooperate. Well… I hope this is different than just running my opponent against a CooperateBot, but I’m worried.
    
    I’m not really sure how any of this will turn out, and I don’t have time to try it on paper right now or learn to use any of the automated tools. But in any case I think it’s good to record what seems like the obvious Payorian PrudentBot, at least to me, even if, as before, the solution ends up non-obvious.
    - James Camacho 11 Mar 2026 18:11 UTC
      3 points
      0
      Parent
      I was talking with Joseph, and I think I like his SharkBot more because it fails more gracefully. Suppose (1) “proof” has an upper bound in computation cycles, and (2) people occasionally make mistakes in their logic. A good prover might spend more computation cycles in error correction. What happens if they do not have enough time to prove cooperation leads to cooperation, or defection leads to defection?
      
      If Joseph’s bot stalls out on the second half of its computation, it concludes, “I couldn’t prove they would cooperate if I’m caught (provably) defecting,” and cooperates. If your bot stalls out on the second half of its computation, it concludes, “I couldn’t prove being caught (provably) defecting would lead to them also defecting,” and thinks it can get away with defection.
      
      Another way of putting it: dumber bots are more likely to think they can get away with defection when they really can’t, and defect against bots smarter than them. If a bot is going to try to take advantage of rocks ^[1] , it better well make sure it is actually playing against a rock, and not just making a stupid mistake that hurts everyone.
      
      Also, as an aside, I think making mistakes (logical bit flips some percent of the time) naturally penalizes high-complexity policies. This is why you might expect societies to begin with mostly cooperate/defect bots, then transition to citizen/police bots, and slowly build complexity where each individual’s policy is relatively simple, but society as a whole gets more complex interactions. I think this would be an interesting area of research.
      
      ↩︎
      From the phrase, “how do you play a Prisoner’s Dilemma against a rock?” Rocks are bots that cooperate even if it is proven you are going to defect.
      - James Payor 12 Mar 2026 9:25 UTC
        2 points
        0
        Parent
        Thanks, this is a solid point that choosing the defect-cooperate outcome should really be based on constructive knowledge, and the fallback shouldn’t be “definitely defect!” if you can’t obtain that.
        So that makes me more think that is the sort of term that is legit in a PrudentBot, since “we can’t find a proof that if we provably defect then they cooperate” is what we actually wanted. And sure the naive translation into provability logic won’t like this, and I think this means we should look for a nicer translation, probably starting with getting a bounded proof-search bot correct.
        I’m wondering about what the SharkBot term is doing, compared to my (probably broken) idea of using . If I converting this one to diamonds I get:
        (“it’s possible that I necessarily defect while my opponent defects”)
        (“it’s possible that my opponent defects while it is not possible that I cooperate”)
        This seems potentially more restrictive than it needs to be? Well, restrictive can be good here, since what we are targeting is defecting as often as we can while guaranteeing a defect-cooperate outcome. But also as pointed out elsethread this PrudentBot approach may have trouble cooperating with itself.
        Anyway this leaves me more ready to unpack the SharkBot condition, which is . This is a bit different than what I landed at, reading “it’s possible that my opponent defects or it is possible that I cooperate”.
        Converting the SharkBot condition back to squares I get:
        Ah okay, so I think this is caching out the version that’s like “I defect if it is stable that my opponent cooperates while I provably defect”. (I think this may be equivalent in strength to my condition under Lob’s theorem?)
        Btw on the theory that “” was a bit of a mistake relative to “” as a basic approach, we could simplify a bunch of the language above to “it’s possible that my opponent defects while I defect” (my one) or “it’s possible that my opponent defects or I defect” (SharkBot). I’m taking this as some evidence that it would be better to go with the forms.
        transhumanist_atom_understander 12 Mar 2026 14:33 UTC
        1 point
        0
        Parent
        I don’t think was a mistake.
        
        This gets at the deeper issue: how do we represent an action having a consequence? I don’t think it’s .
        
        In the Prisoner’s Dilemma tournament, you don’t respond to my action, you respond to what you can prove about my action, so it’s really that’s important.
        
        But I think that even in single player games, we should think of the environment as containing an implementation of us which takes an action when it can prove it’s the “right” action, so even then it’s provability that has consequences.
        
        This may seem unnatural, but that’s just because considered decisions are unnatural, and mostly we just do stuff. But I think it’s a good model of thinking with a notebook open and plenty of time. That is, our model is that the consequences in the environment trace back to the existence of a proof in Peano Arithmetic.
        
        My only reservation really is that I think maybe it should be , where means provable with a consistency assumption. Otherwise, how do I know that my implementation in the environment (for a single-player game) or my opponent’s simulation of me (in the Prisoner’s Dilemma) will actually find a proof of before a proof of ? Even if is provable, the implementation or simulation could get to a contradiction first.
    - James Payor 12 Mar 2026 8:34 UTC
      2 points
      0
      Parent
      (Thanks for the post btw! My comments on this comment below.)
      This PrudentBot def does feel in-spirit to me. I also agree with your analysis that it doesn’t fall for the pitfall of “I assumed in my hypothetical that you thought I would unconditionally cooperate with you, but then no fair you defected on me in my hypothetical! I defect!”
      For instance this happens if you try, as I was inclined to try, . This quickly becomes “false” i.e. “defect” if you have a lying around.
      With your definition, we have , and assuming we have a around I think that simplifies to . (I’m not sure I have that right!)
      That’s… better behaved! It will run into the same Lobian difficulties as usual if we directly translate this as a provability bot. So imo we still need some better answer as to how we should interpret these as programs that more faithfully represents what we want in terms of their reasoning.
      Also fwiw, I run with the model that the simulated opponent has access to both a description of your full behaviour and a shortcut proof that you cooperate with just them. (In the cooperation branch hypothetical.) So when your simulations have the cooperation proof in scope, you’re just saying that the behaviour is equivalent to CooperateBot in this very matchup. And it happens that this is often a sufficient condition for cooperation. (And if your opponent runs you on a different bot like DefectBot, the cooperation proof doesn’t apply there.)
      So there’s some question here like “well what’s the point of simulating under the assumption of the narrow cooperation proof if we’re keeping the full description as a fallback?”
      Part of what your post points at is an answer which is that, well, it works out more cleanly if you do! Specifically that reasoning is easier if, for the purposes of deciding to cooperate, you assume the decision is made and check that it’s stable/good.
      As for why that is the case, I have a more philosophical take that it’s about encoding choice, and the better mechanics we see are downstream of encoding this better. I’ll go into more detail in this collapsed section (you may well wish to gather your own thoughts first).
      on encoding choice
      We can think of the bot as containing some mechanisms that lead to certain actions, with conditions for firing.
      The FairBot variant described in this post is a simple instance, with a default action of defection and a single mechanism that leads to coooperation when it fires.
      I have some idea that the choice-y mechanisms are of the form “if I activate, will the outcome satisfy some property? if so I activate”.
      So for a given mechanism we can have a widget with a goal target . We want a way of saying “if were to fire this would lead to ” as our condition for . And there’s a question of how to encode this in a program or in our modal logic setting.
      Straight up trying won’t work, since and you won’t be able to form a statement like this. This was omitting the part where is using some model of consequences to make its choice.
      So next up is , in which fires if our modelling shows that if fires this will lead to . You can in fact show that here!
      The other option is ; this doesn’t assume ” fires” but rather “someone has a model that fires”, and is actually a bit more brittle. The reason this form can be preferable in the modal logic bots is that you are sidestepping the complexity of showing that you will need to check your opponent can tell that has fired. But I think this may be a bit of a hack relative to the version.
      My reasoning here suggests for that if you have multiple conditions for action, you may factor out // that are each self-referential, and build your high level agent as trying each in sequence. I haven’t fully cached out yet what this looks like, and whether it e.g. gives you a meaningfully different PrudentBot.
      I note that this picture doesn’t provide a convenient answer for the problem where in provability land your action “defect” may cause someone to need to prove . But as I guess I keep mentioning I view this as a bug in the provability encoding.
- [ ]
  [deleted]