Four levels of understanding decision theory

[Update 2023-06-09: See this comment for some caveats /​ motivations /​ context about this post.]

There are multiple levels on which an agent can understand and implement a particular decision theory. This post describes a taxonomy of understanding required for robust cooperation in the Prisoner’s Dilemma.

The main point I’m trying to convey in this post is that the benefits of implementing and following a particular decision theory don’t necessarily come from merely understanding the theory, even if your understanding is deep, precise, and correct.

For example, just because you correctly understand logical decision theories and their implications (e.g. why LDT agents might cooperate with each other in the prisoner’s dilemma under certain conditions), that doesn’t mean that you yourself are a LDT agent, even if you want to be.

Actually implementing a decision theory in real situations often requires hard cognitive work to correctly model your counterparties, as well as the ability to make your own mind and decision process model-able and legible enough to your counterparties. You must have some way of making your outwardly-professed intentions highly correlated with your actual behavior and decision process. The difficulty of creating a robust and outwardly-visible correlation between your behavior prior to making a decision, and your actual decision, varies by the kind of agent you are, and with the capabilities of the agent you are trying to cooperate with or defect against—for a human cooperating with another human, it might be very difficult. For agents whose decision process source code is easily accessible, it may be relatively easy.

The rest of this post describes different levels of comprehension an agent might have about decision theory when applied to the Prisoner’s Dilemma. Achieving higher levels in the taxonomy likely requires the agent to be relatively more capable and coherent than agents that achieve only lower levels. For the final level, the agent must have some degree of control over its own internal thought patterns and decision processes.


Note: the taxonomy in this post is kind of a fake framework, but it might be useful in clearing up some common misconceptions (e.g. ones that lead to worries that this post is trying to address), and explaining why Decision theory does not imply that we get to have nice things.

Level 1: Understanding that good things are good, and bad things are bad

At this level, you are smart enough to comprehend the payoff matrix for the Prisoner’s Dilemma, and what it means:

1: C1: D
2: C(3, 3) (5, 0)
2: D(0, 5)(2, 2)



If you’re player 1, you recognize that you would prefer (D, C) > (C, C) > (D, D) > (C, D).

You may or may not understand that (D, C) might be unrealistic or hard to obtain in many situations—a fabricated option, potentially. You recognize that if (C, C) and (D, D) were the only options on the table, (C, C) is preferable to (D,D).

Agents below this level of understanding don’t understand their own preferences on even a basic level—bad things can happen to them, and they might say “ouch”, but they’re often stepping on their own toes across time, or making decisions for plainly incoherent or wrong reasons, given their own professed or revealed preferences.

Level 2: An understanding of, and desire to avoid, the Nash equilibrium

At this level, you understand the payoff matrix in level 1, and also recognize the symmetry of the situation. You’d like to get the (D, C) outcome, but you recognize your opponent is trying for (C, D), and you see how this could be a problem.

You understand the concept of a Nash equilibrium, and see that (D, D) is the only such equilibrium in this game. You yearn to do Something Else Which is Not That, and you see that symmetry may play an important role. But you don’t necessarily know what role or how to formalize it. Perhaps you intuitively see why you would cooperate with an identical clone of yourself placed in identical decision-making circumstances.

Level 3: Knowing and understanding formal theories for when and how to avoid Nash equilibria

You understand the kind of mathematics an agent must implement to achieve (C, C) robustly, and under which circumstances you can actually pull a fast one on your counterparty and get (D, C) with high probability.

You understand precisely why PrudentBot cooperates with FairBot but defects against CooperateBot. You see the advantage and desirability of generalizing the concept of pre-commitment, and of making each decision by choosing the optimal decision making algorithm over all possible world states. But you don’t necessarily know how to implement this kind of decision process (which may be provably intractable or undecidable in general), even in very limited circumstances.

Level 4: Actually avoiding (D,D) for decision theory reasons

You understand everything in level 3, and are capable of actually implementing such a decision theory, in at least some circumstances and with some counterparties. Your decisions are actually correlated with you and your counterparties’ understanding of decision theory and models of each other. The (perhaps literal) source code for your decision-making process is verifiably accessible to your counterparties. You may be able to modify your source at will, but your counterparties can verify that the process you use to actually make your decision is based only on some relatively simple algorithm which you expose. You (provably and verifiably to your counterparty) cooperate with PrudentBot and FairBot, and defect against CooperateBot.


Humans probably struggle to achieve this level, except in limited circumstances or by using specialized techniques. In a True Prisoner’s Dilemma with another human, two humans might cooperate, and might profess or actually believe that their cooperation is a result of their mutual understanding of decision theory. But in many cases, they are actually cooperating for other reasons (e.g. valuing friendliness or a sense of honor).


Among humans, level 3 (understanding) is probably a prerequisite for level 4 (implementation). For AI systems, this may or may not be true for particular decisions, depending on the situation under which those decisions are made (for example the programs in this paper are too simple to have an understanding of decision theory themselves, but under appropriate conditions, they may be sufficient for actually implementing a decision process correctly according to some formal decision theory.)

Note that if you’re a level 1 agent or below playing against some other agent, the other agent will mostly cooperate or defect against you for non-decision theoretic reasons, regardless of their own level of decision theory comprehension. If the the other agent is confident that you’ll cooperate (for whatever reason) and they’re feeling nice or friendly towards you, or have some other kind of honor-value, they may cooperate. Otherwise, they’ll defect. Decision theory starts to become relevant when both agents are level 2 or above, but most outcomes in a PD are probably not functions of (only) the two agents’ decision theories until both agents are at level 4.

In sum: A human or AI system might have a deep, precise, and correct understanding of a decision theory, as well as a desire or preference to adhere to that decision theory, without actually being capable of performing the counterparty modeling, legibility, and other cognitive work necessary to implement that decision theory to any degree of faithfulness.