Diagonalization: A (slightly) more rigorous model of paranoia

In my post on Wednesday (Paranoia: A Beginner’s Guide), I talked at a high level about the experience of paranoia, and gave two models (the lemons market model and the OODA loop model) that try get us a bit closer to understanding its nature and purpose.

I then made a big claim that went largely unargued in the post, that there are three kinds of strategies that make sense to pursue in adversarial information environments:

You blind yourself
You eliminate the sources of deception
You act unpredictably

Now, Unnamed brought up a very reasonable critique in the comments! Why would there be exactly three strategies that make sense? How can we have any confidence that there isn’t a 4th kind of strategy that works?

And, in reality, the space of strategies is huge! Many of the most effective strategies (like building networks of trust, hiring independent auditors, performing randomized experiments, and “getting better at figuring out the truth on your own”) don’t neatly fit into the categories in that post. Maybe they can somehow be forced into this ontology, but IMO they are not a great fit.

But I argue that there is a semi-formal model in which this set of three strategies fully covers the space of possible actions, and as such that decomposing the space of strategies into these three categories is more natural than just “I pulled these three strategies out of my bag and randomly declared them the only ones”. This semi-formal model also introduces the term “diagonalization” which I have found to be a useful handle.

I think “paranoia” centrally becomes adaptive when you are in conflict with a “more competent”^[1] adversary. Now, we unfortunately do not have a generally accepted and well-formalized definition of “competence”, especially in environments with multiple agents. However, I think we can at least talk about some extreme examples where an agent is “strictly more competent” than another agent.

One such possible definition of “strictly more competent” is when the more competent agent can cheaply^[2] predict everything the other agent will do (even including how it will react to the bigger agents attempts at doing so). In such cases the stronger agent in some sense “contains” the smaller agent:

When a larger agent contains a smaller agent this way, the smaller agent can simply be treated like any other part of the environment. If you want to achieve a goal, you simply choose the actions that produce the best outcome, including the reaction from the smaller agent.

You can solve this optimization problem with brute search if the input space is small and the agent and environment is deterministic, or something like gradient descent if the input space is big and the agent is nondeterministic. If the smaller agent tries to predict what you are going to do and adapt, you predict how the smaller agent will model you, and then choose actions that most exploit the weaknesses in that model.

I often refer to this as the act of “diagonalizing” against the smaller agent.

Sidebar on the origin of the term “diagonalization”

I’ve encountered the term “diagonalization” for this kind of operation in MIRI-adjacent circles. I am not even sure whether I am using the term the same way they are using it, but I have found the way I am using it to be a very useful handle (though with a terribly inaccessible name that IMO we really should change).

The origin of this term is unclear to me but the first mention that I can find for it is this 2012 @Vladimir_Nesov post. Applying the ideas in that post to a straightforward adversarial game looks as follows (please forgive my probably kind of botched explanation, and I invite anyone who was more involved with the etymology of the term to give a better one):

The problem with trying to predict what an adversary will do in response to your actions is of course that they will be trying to do the same to you.

Now, let’s say agent A was trying to predict what you, agent B, are going to do and is trying to adapt to that. Now let’s assume agent A’s models of you have some flaws, and you know those flaws. Then you can simulate agent A, including their flawed model of you, and then conditional on them using that flawed model, choose the best counter-response what they are doing (which is going to be different from what they predicted).

This is somewhat similar to how in Cantor’s diagonal argument you find a way to prove the reals are of greater magnitude than the rationals by finding a way to list all the rationals, then stepping outside of that set via choosing diagonal entries from each item in the list:

We can proof by contradiction that if one agent is diagonalizing another agent in an adversarial zero sum game (without nash equilibria) this way, the other agent cannot in turn do the same (if the game has nash equilibria, then we can prove that the if the two agents are diagonalizing each other, they must end up in one of them, which is equivalent to both players playing a minimax strategy).

Without loss of generality, let’s assume agent A and agent B are playing rock, paper, scissors as their adversarial game. Let’s assume agent A is also perfectly predicting agent B, hence predicts that agent B will choose rock, and hence chooses paper. Then by assumption agent B knows that agent A is choosing paper, and will play scissors in order to win.

This of course is a contradiction and so we know one of the assumptions is wrong and not both players can be predicting each other perfectly and diagonalizing against those decisions in this game.

Now, in the situation of facing an opponent “strictly more competent”, as defined above, your choices are quite limited. You have been “diagonalized against”, every move of yours has been predicted with perfect accuracy, and your opponent has prepared the best countermeasure for each. The best you can do is to operate on a minimax strategy where you take actions assuming your opponent is playing strictly optimally against you, and maybe try to eke out a bit of utility along the way.

However, the model above does suggest some natural weakenings of “strictly more competent” that create a bit more wiggle room.

In any realistic scenario, in order to do something akin to diagonalizing an opponent, you need to do the following:^[3]

You need to get information about their internal workings to build a model of them
You need to sample^[4] that model to extract predictions about their behavior
You need to identify parts of the model’s input space of the model that reliably produce the actions that you want, conditional on having observed your actions

And each of the three component strategies of paranoia I argued for in Paranoia: A Beginner’s Guide, addresses one of these:

1. By blinding yourself to information channels that are more controllable, you force an opponent to search harder for inputs that produce the behaviors they want^[5], making it harder to come up with reliably adversarial inputs (i.e. step 3)

2. By removing the adversarial agents from your environment you make it harder for those adversaries to get information about you and to build a model of you in the first place (i.e. step 2)^[6]

3. By making yourself erratic and unpredictable you increase the number of times you make yourself more costly to predict, usually requiring many more samples to get adequate bounds on your behavior, making the cost of predicting you higher (i.e. step 2)

Overall the set of paranoid strategies in Paranoia: A Beginner’s Guide was roughly the result of looking at each step in the process of simulating and diagonalizing against another agent and thinking about how to thwart it.

But how well does this toy model translate to reality?

I think it’s pretty messy. In-particular, many strategies best suited to adversarial information environments rely on forming agreements and contracts with other agents that are not adversarial to you, e.g. to perform the role of auditors. The above model has no space for other agents that are not you and the bigger agent.

The strategies I list are also all focused on “how to make the enemy be less good at hurting me” and not very focused on “how do I perform better after I have cut off the enemy (via the strategies of paranoia)”. When thinking about strategies adaptive to adversarial environments “learning how to think from first principles” is IMO basically the top one, but since the above model is framed in a zero-sum context, we can’t speak much about upside outside of the conflict context.

But overall, I am still quite happy to get these models out. I have for years been warning others of “the risks of diagonalization” and been saying insane-sounding things like “I don’t want to diagonalize against them too hard”, and maybe now anyone will actually understand what I am saying without me having to start with a 20-minute lecture on set-theory.

Postscript

Ok, but please, does anyone have a suggestion for a better term than “diagonalization”?

Like, the key challenge that all alternatives I can think of lack is the flexibility of this word. It has all the different tenses and conjugations and flows nicely. “That’s diagonalization”, “He is diagonalizing you”, “I am being diagonalized” are all valid constructions. Alternatives like “adversarial prediction” are both much more ambiguous, and don’t adapt to context that well. “That’s adversarial prediction”, “He is adversarially predicting you”, “I am being adversarially predicted” sound awkward, especially the last one.

But IMO this is a really useful concept that I am hoping to build on more. I would like to be to use it without needing to give a remedial set-theory class every time, so if anyone has a better name, I would greatly appreciate suggestions.

^
Or an adversary with more time to spend on a conflict than you have.
^
“Cheaply” in the limit meaning “the stronger agent can do this for a weaker agent as many times as they like”. This is of course quite extreme and runs into the limits of computability, but I at least for now don’t know how to weaken it to make it more realistic.
^
But “Habryka, stop!” you scream, as I justify one “list of three things that intuitively seem like the only options” with another “list of three things that intuitively seem like the only options”, and you know, fair enough. But look man, our toy model in this situation really has many fewer moving parts, and I think the argument for why these are the only three things to do is more robust than the previous one.
^
You don’t actually need to “sample” it, though it’s of course the most natural thing to do. I can predict the outputs of programs without sampling from them, and similarly having formed a model of another agent, you can do things much more sophisticated than simply sampling trajectories. But for simplicity, let’s talk about “sampling”, and I think this shouldn’t change any of the rest of the argument, though honestly I haven’t checked that hard.
^
Or, in practice, force them to take more costly actions to control a larger part of your input space. This however is outside the realm of the narrow semi-formal model I am proposing here, as we are not modeling actions as having costs. It probably wouldn’t be too hard to properly add to the model, but I haven’t tried it.
^
As well as of course potentially eliminating the bigger agent altogether, which is not addressed in this model, as death is not part of our tiny little toy world.