Diagonalization: A (slightly) more rigorous model of paranoia

habryka16 Nov 2025 7:56 UTC

32 points

In my post on Wednesday (Paranoia: A Beginner’s Guide), I talked at a high level about the experience of paranoia, and gave two models (the lemons market model and the OODA loop model) that try get us a bit closer to understanding its nature and purpose.

I then made a big claim that went largely unargued in the post, that there are three kinds of strategies that make sense to pursue in adversarial information environments:

You blind yourself
You eliminate the sources of deception
You act unpredictably

Now, Unnamed brought up a very reasonable critique in the comments! Why would there be exactly three strategies that make sense? How can we have any confidence that there isn’t a 4th kind of strategy that works?

And, in reality, the space of strategies is huge! Many of the most effective strategies (like building networks of trust, hiring independent auditors, performing randomized experiments, and “getting better at figuring out the truth on your own”) don’t neatly fit into the categories in that post. Maybe they can somehow be forced into this ontology, but IMO they are not a great fit.

But I argue that there is a semi-formal model in which this set of three strategies fully covers the space of possible actions, and as such that decomposing the space of strategies into these three categories is more natural than just “I pulled these three strategies out of my bag and randomly declared them the only ones”. This semi-formal model also introduces the term “diagonalization” which I have found to be a useful handle.

I think “paranoia” centrally becomes adaptive when you are in conflict with a “more competent”^[1] adversary. Now, we unfortunately do not have a generally accepted and well-formalized definition of “competence”, especially in environments with multiple agents. However, I think we can at least talk about some extreme examples where an agent is “strictly more competent” than another agent.

One such possible definition of “strictly more competent” is when the more competent agent can cheaply^[2] predict everything the other agent will do (even including how it will react to the bigger agents attempts at doing so). In such cases the stronger agent in some sense “contains” the smaller agent:

When a larger agent contains a smaller agent this way, the smaller agent can simply be treated like any other part of the environment. If you want to achieve a goal, you simply choose the actions that produce the best outcome, including the reaction from the smaller agent.

You can solve this optimization problem with brute search if the input space is small and the agent and environment is deterministic, or something like gradient descent if the input space is big and the agent is nondeterministic. If the smaller agent tries to predict what you are going to do and adapt, you predict how the smaller agent will model you, and then choose actions that most exploit the weaknesses in that model.

I often refer to this as the act of “diagonalizing” against the smaller agent.

Sidebar on the origin of the term “diagonalization”

I’ve encountered the term “diagonalization” for this kind of operation in MIRI-adjacent circles. I am not even sure whether I am using the term the same way they are using it, but I have found the way I am using it to be a very useful handle (though with a terribly inaccessible name that IMO we really should change).

The origin of this term is unclear to me but the first mention that I can find for it is this 2012 @Vladimir_Nesov post. Applying the ideas in that post to a straightforward adversarial game looks as follows (please forgive my probably kind of botched explanation, and I invite anyone who was more involved with the etymology of the term to give a better one):

The problem with trying to predict what an adversary will do in response to your actions is of course that they will be trying to do the same to you.

Now, let’s say agent A was trying to predict what you, agent B, are going to do and is trying to adapt to that. Now let’s assume agent A’s models of you have some flaws, and you know those flaws. Then you can simulate agent A, including their flawed model of you, and then conditional on them using that flawed model, choose the best counter-response what they are doing (which is going to be different from what they predicted).

This is somewhat similar to how in Cantor’s diagonal argument you find a way to prove the reals are of greater magnitude than the rationals by finding a way to list all the rationals, then stepping outside of that set via choosing diagonal entries from each item in the list:

We can proof by contradiction that if one agent is diagonalizing another agent in an adversarial zero sum game (without nash equilibria) this way, the other agent cannot in turn do the same (if the game has nash equilibria, then we can prove that the if the two agents are diagonalizing each other, they must end up in one of them, which is equivalent to both players playing a minimax strategy).

Without loss of generality, let’s assume agent A and agent B are playing rock, paper, scissors as their adversarial game. Let’s assume agent A is also perfectly predicting agent B, hence predicts that agent B will choose rock, and hence chooses paper. Then by assumption agent B knows that agent A is choosing paper, and will play scissors in order to win.

This of course is a contradiction and so we know one of the assumptions is wrong and not both players can be predicting each other perfectly and diagonalizing against those decisions in this game.

Now, in the situation of facing an opponent “strictly more competent”, as defined above, your choices are quite limited. You have been “diagonalized against”, every move of yours has been predicted with perfect accuracy, and your opponent has prepared the best countermeasure for each. The best you can do is to operate on a minimax strategy where you take actions assuming your opponent is playing strictly optimally against you, and maybe try to eke out a bit of utility along the way.

However, the model above does suggest some natural weakenings of “strictly more competent” that create a bit more wiggle room.

In any realistic scenario, in order to do something akin to diagonalizing an opponent, you need to do the following:^[3]

You need to get information about their internal workings to build a model of them
You need to sample^[4] that model to extract predictions about their behavior
You need to identify parts of the model’s input space of the model that reliably produce the actions that you want, conditional on having observed your actions

And each of the three component strategies of paranoia I argued for in Paranoia: A Beginner’s Guide, addresses one of these:

1. By blinding yourself to information channels that are more controllable, you force an opponent to search harder for inputs that produce the behaviors they want^[5], making it harder to come up with reliably adversarial inputs (i.e. step 3)

2. By removing the adversarial agents from your environment you make it harder for those adversaries to get information about you and to build a model of you in the first place (i.e. step 2)^[6]

3. By making yourself erratic and unpredictable you increase the number of times you make yourself more costly to predict, usually requiring many more samples to get adequate bounds on your behavior, making the cost of predicting you higher (i.e. step 2)

Overall the set of paranoid strategies in Paranoia: A Beginner’s Guide was roughly the result of looking at each step in the process of simulating and diagonalizing against another agent and thinking about how to thwart it.

But how well does this toy model translate to reality?

I think it’s pretty messy. In-particular, many strategies best suited to adversarial information environments rely on forming agreements and contracts with other agents that are not adversarial to you, e.g. to perform the role of auditors. The above model has no space for other agents that are not you and the bigger agent.

The strategies I list are also all focused on “how to make the enemy be less good at hurting me” and not very focused on “how do I perform better after I have cut off the enemy (via the strategies of paranoia)”. When thinking about strategies adaptive to adversarial environments “learning how to think from first principles” is IMO basically the top one, but since the above model is framed in a zero-sum context, we can’t speak much about upside outside of the conflict context.

But overall, I am still quite happy to get these models out. I have for years been warning others of “the risks of diagonalization” and been saying insane-sounding things like “I don’t want to diagonalize against them too hard”, and maybe now anyone will actually understand what I am saying without me having to start with a 20-minute lecture on set-theory.

Postscript

Ok, but please, does anyone have a suggestion for a better term than “diagonalization”?

Like, the key challenge that all alternatives I can think of lack is the flexibility of this word. It has all the different tenses and conjugations and flows nicely. “That’s diagonalization”, “He is diagonalizing you”, “I am being diagonalized” are all valid constructions. Alternatives like “adversarial prediction” are both much more ambiguous, and don’t adapt to context that well. “That’s adversarial prediction”, “He is adversarially predicting you”, “I am being adversarially predicted” sound awkward, especially the last one.

But IMO this is a really useful concept that I am hoping to build on more. I would like to be to use it without needing to give a remedial set-theory class every time, so if anyone has a better name, I would greatly appreciate suggestions.

^
Or an adversary with more time to spend on a conflict than you have.
^
“Cheaply” in the limit meaning “the stronger agent can do this for a weaker agent as many times as they like”. This is of course quite extreme and runs into the limits of computability, but I at least for now don’t know how to weaken it to make it more realistic.
^
But “Habryka, stop!” you scream, as I justify one “list of three things that intuitively seem like the only options” with another “list of three things that intuitively seem like the only options”, and you know, fair enough. But look man, our toy model in this situation really has many fewer moving parts, and I think the argument for why these are the only three things to do is more robust than the previous one.
^
You don’t actually need to “sample” it, though it’s of course the most natural thing to do. I can predict the outputs of programs without sampling from them, and similarly having formed a model of another agent, you can do things much more sophisticated than simply sampling trajectories. But for simplicity, let’s talk about “sampling”, and I think this shouldn’t change any of the rest of the argument, though honestly I haven’t checked that hard.
^
Or, in practice, force them to take more costly actions to control a larger part of your input space. This however is outside the realm of the narrow semi-formal model I am proposing here, as we are not modeling actions as having costs. It probably wouldn’t be too hard to properly add to the model, but I haven’t tried it.
^
As well as of course potentially eliminating the bigger agent altogether, which is not addressed in this model, as death is not part of our tiny little toy world.

habryka16 Nov 2025 7:56 UTC

32 points

13 comments6 min readLW link

Unnamed 17 Nov 2025 21:05 UTC
26 points
2
I count 5 strategies in this post & the previous one, rather than 3:
1. Blinding. Block information input from the adversary to you.
2. Privacy. Block information output from you to the adversary.
3. Disempowerment. Don’t let the adversary have control over parts of the environment that you care about.
4. Vindictiveness. Do things that are opposed to the adversary’s interests.
5. Randomness. Do things that are hard for the adversary to predict.
#3 Disempowerment was least explicitly stated in your writing but was present in how you talked about purging / removal from your environment. Examples: Don’t have a joint bank account with them, don’t appoint them to a be in charge of a department in your organization, don’t make agreements with them where they have the official legal rights but there’s a handshake deal that they’ll share things with you.
Truman’s response to the Red Scare included all (or at least most) of the first 4 strategies. It was primarily #2 Privacy—in fact the Soviet spies were mainly doing espionage—acquiring confidential information from the US government—and purging them was blocking them from getting that information. But Truman was worried about them doing subversion (getting the US government to make bad decisions) which would make purging them #3 Disempowerment. And executing them (rather than just firing them) makes it #4 Vindictive too.
The Madman Theory example in the other post is mainly about vindictiveness (it’s a threat to retaliate), even though it’s done in a way that involves some randomness.
#5 Randomness feels least like a single coherent thing out of these 5. I’d break it into:
5a Maximin. Do things that work out best in the worst case scenario. This often involves a mixed strategy where you randomize across multiple possible actions (assuming you have a hidden source of randomness).
5b Erraticness. Thwart their expectations. Don’t do the thing that they’re expecting you to do, or do something that they wouldn’t have expected.
Though #5b Erraticness seems like an actively bad idea if you have been fully diagonalized, since in case you won’t actually succeed at thwarting their expectations and your erratic action will instead be just what they wanted you to do. It is instead a strategy for cat-and-mouse games where they can partially model you but you can still hope to outsmart them.
If you have been diagonalized, it’s better to limit your repertoire of actions. Choose inaction where possible, stick to protocol, don’t do things that are out of distribution. The smaller the set of actions that you ever do, the fewer options the diagonalizer has for what to get you to do. A hacker gets a computer system into a weird edge case, a social engineer gets someone to break protocol, a jailbreaker gets an LLM into an out-of-distribution state. An aspiring diagonalizer also wants to influence the process that you use to make decisions, and falling back on a pre-existing protocol can block that influence. I would include this on my list of strategies, maybe #6 Act Conservatively.
Looking back through these, most of them aren’t that specific to diagonalization scenarios. Strategies 4 (Vindictiveness) & 5a (Maximin) are standard game theory which come up in lots of contexts. I think that strategies 1-3 fall out of a fairly broad sense of what it means for someone to be an adversary—they are acting contrary to your interests, in a way that’s entangled with you; they’re not just off somewhere else doing things you don’t like, they are in some way using you to get more of the thing that’s bad for you. In what ways might they be using you to get more of the thing? Maybe they’re getting information from you which they can then use for their purposes, maybe they’re trying to influence what you do so you do what they want, maybe you’ve let them have control over something which you could have disallowed. Strategies 1 (Blinding), 2 (Privacy), and 3 (Disempowerment) just involve undoing/blocking one of those.
- habryka 17 Nov 2025 21:21 UTC
  4 points
  0
  Parent
  #5 Randomness feels least like a single coherent thing out of these 5. I’d break it into:
  5a Maximin. Do things that work out best in the worst case scenario. This often involves a mixed strategy where you randomize across multiple possible actions (assuming you have a hidden source of randomness).
  5b Erraticness. Thwart their expectations. Don’t do the thing that they’re expecting you to do, or do something that they wouldn’t have expected.
  Though #5b Erraticness seems like an actively bad idea if you have been fully diagonalized, since in case you won’t actually succeed at thwarting their expectations and your erratic action will instead be just what they wanted you to do. It is instead a strategy for cat-and-mouse games where they can partially model you but you can still hope to outsmart them.
  I found this section quite helpful and think splitting that into these two parts is probably the right call (including the caveat that this backfires if your opponent has actually diagonalized you).
  I am working on a post trying to find a set of more common-language abstractions for reasoning about this stuff, where I think the eraticness fits a bit better into.
Archimedes 17 Nov 2025 5:50 UTC
22 points
0
For alternatives to “diagonalization,” the term “next-leveling” is less ambiguous than just “leveling”, IMO. It more directly suggests increased depth of counter-modeling / meta-cognitive exploitation.

A more obscure option is “Yomi”. Yomi (読み, literally “reading” in Japanese) is already established terminology for recursive prediction. In fighting games, yomi layers represent recursive depths of prediction (layer 1: predicting their action, layer 2: predicting their prediction of your action, etc.).

As a card game: https://www.sirlin.net/articles/designing-yomi
- habryka 17 Nov 2025 21:22 UTC
  2 points
  0
  Parent
  Oh, “Yomi” is a very interesting suggestion, thank you! I’ve definitely encountered that term and can imagine it being a great replacement here. I’ll read more about it and see how to adapt it to non-game contexts.
Vladimir_Nesov 16 Nov 2025 11:05 UTC
14 points
0

We can proof by contradiction that if one agent is capable of predicting another agent, the other agent cannot in turn do the same.

Only if one of them is diagonalizing the other (acting contrary to what the other would’ve predicted about its actions). If this isn’t happening, maybe there is no problem.

For example, the halting problem is unsolvable because you are asking for a predictor that simultaneously predicts behavior of every program, and among all programs there is at least one (that’s easy to construct) that is diagonalizing the predictor’s prediction of its behavior (acting contrary to what the predictor would’ve predicted about its behavior), by predicting the predictor and doing the opposite. But proving that a specific program halts (or not) is often possible, that’s not the halting problem.

If the smaller agent was also perfectly predicting the bigger agent, then the bigger agent couldn’t be perfectly predicting the smaller agent, as doing so would trigger an infinite regress

There is no infinite regress, and probably no useful ordering of agents/programs by how big they are in this way. It’s perfectly possible for agents to reason about each other, including about their predictions about themselves or each other. And where there is diagonalization, it doesn’t exactly say which agent was bigger (an agent can even diagonalize itself, to make its own actions unpredictable to itself).

See for example the ASP problem where in Newcomb’s problem the predictor is “smaller” and by-stipulation predictable (rather than an all-powerful Omega), and so the “bigger” box-choosing agent needs to avoid any sudden movements in its thoughts to keep itself predictable and get the big box filled by the predictor.

Maybe quines can illustrate how there is no by-default infinite regress. You can write a program in python that prints a program in Java that in turn prints the original program in python. Neither of the programs is “bigger” than the other.

When a larger agent contains a smaller agent this way, the smaller agent can simply be treated like any other part of the environment. If you want to achieve a goal, you simply figure what action of yours produces the best outcome, including the reaction from the smaller agent.

Other than blinding itself to the bigger agent’s actions, alternative safer ways of observing the bigger agent might be available, reasoning about it rather than observing what it actually does directly. Even a “big” agent doesn’t contain or control all the reasonings about it, a theory of an agent is bigger than the agent itself, and others can pick and choose what to reason about. Also, self-contained reasoning that produces some conclusion can itself make use of observations of the “big” agent, if the observations are not used for anything else, so it’s not even necessarily about blinding, but rather compartmentalized reasoning where the observations (tainted data) don’t get indiscriminate influence, but can still be carefully used to learn things.

Ok, but please, does anyone have a suggestion for a better term than “diagonalization”?

It’s from Cantor’s diagonal argument. See also diagonal argument, Lawvere’s fixpoint theorem. It’s just this, you construct an endomap without fixpoints and that breaks stuff. This works as well for maps that are defined/enacted by agents in their behavior, mapping beliefs/observations to actions, you just need to close the loop so that beliefs/observations start talking about the same things as the actions.
- habryka 16 Nov 2025 18:38 UTC
  2 points
  0
  Parent
  Ok, having more time today and thinking more about it, I have updated the description of the proof in the infobox! Curious whether it seems better/more accurate to you now.
- habryka 16 Nov 2025 18:12 UTC
  2 points
  0
  Parent
  Only if one of them is diagonalizing the other (acting contrary to what the other would’ve predicted about its actions). If this isn’t happening, maybe there is no problem.
  Ah, yes, of course. I’ll update the description.
  Agree on all the rest, I think. I didn’t intent to establish a strict ordering of agents (though my usage of bigger and smaller in the strict case of adversarial diagonalizing agents sure suggested it). In those case I find it a useful visualization to think about bigger and smaller.
  It’s from Cantor’s diagonal argument.
  I agree that “diagonalization” is a fine term for the specific narrow thing where you choose actions contrary to what the other agent would have predicted you would do, in the way described here, but I am more talking about the broader phenomenon of “simulating other agents adversarially in order to circumvent their predictions”. “Leveling” is apparently a term from poker that means something kind of similar and more general:
  Levelling
  Leveling in poker is the process of anticipating what your opponent thinks you are thinking, often leading to deeper layers of strategic decision-making.
  Its purpose is to outthink opponents by operating on a higher mental “level” than they are.
  Like, I would like a term for this kind of thing that is less opinionated about the exact setup, and technical limitations. Like, I am pretty sure there is a more general phenomenon here.
  - Vladimir_Nesov 16 Nov 2025 22:47 UTC
    2 points
    0
    Parent
    
    I am more talking about the broader phenomenon of “simulating other agents adversarially in order to circumvent their predictions”
    
    The idea of “simulating adversarially” might be a bit confusing in the context of diagonalization, since it’s the diagonalization that is adversarial, not the simulation. In particular, you’d want mutual simulation (or rather more abstract reasoning) for coordination. If you merely succeed in acting contrary to a prediction, making the prediction wrong, that’s not diagonalization. What diagonalization does is make the prediction not-happen in the first place (or in the case of putting a credence on something, for the credence to remain at some weaker prior). So diagonalization is something done against a predictor whose prediction is targeted, rather than something done by the predictor. A diagonalizer might itself want to be a predictor, but that is not necessary if the prediction is just given to it.
- Annabelle 16 Nov 2025 14:43 UTC
  1 point
  0
  Parent
  We can proof by contradiction that if one agent is capable of predicting another agent, the other agent cannot in turn do the same.
  I’m glad you responded to this as this stood out to me too.
  Maybe quines can illustrate how there is no by-default infinite regress.
  Quines only illustrate that there is no by-default infinite regress within the assumed system (here, a formal, deterministic string-rewriting game), which is built on assumptions themselves subject to the Munchhausen Trilemma.
  I’m not trying to be pedantic here; I think it’s pretty important to consider the implications of this.
Purplehermann 16 Nov 2025 20:11 UTC
1 point
0
Mindfucking
- habryka 16 Nov 2025 21:17 UTC
  2 points
  0
  Parent
  Has some things going for it, but probably too vulgar for most occasions, and also IMO implies too much malintent. I like “diagonalization” because it’s kind of a thing you can imagine doing.
  I think my current favorite choice is “Leveling” and “starting a leveling war” from poker. Sentences like “man, I feel like this is too much trying to level them” and “this feels like it’s starting a leveling war” are decent-ish pointers.
Annabelle 16 Nov 2025 18:59 UTC
1 point
0
Ok, but please, does anyone have a suggestion for a better term than “diagonalization”?
Outmaneuvering?
Trevor Hill-Hand 16 Nov 2025 18:09 UTC
1 point
0
Calling it “diagonalization” is distracting for me, since I associate that so much with a particular type of strategy for reasoning through proofs. “Encapsulation” feels like another potential word for what you’ve described, though I don’t think it’s ‘better’ in the sense of being more clear to the average listener. “Trapped in the Matrix” almost feels right, but I generally don’t like using fictional references, especially since we see how that can go weirdly like with ‘redpilling’.

But, it comes to mind because of the idea of being inside a system and yet being able to have an understanding of that system which is bigger than the system itself, even potentially letting you alter the system and thus foil the its plans. Even when you are ‘diagonalized’ or ‘encapsulated’ by an agent/system, that greater agent is still itself within a greater possibility space of all possible agents/systems.

Diagonalization: A (slightly) more rigorous model of paranoia

Postscript

Levelling