Conceptual Problems with UDT and Policy Selection

abramdemski28 Jun 2019 23:50 UTC

LW: 66 AF: 25

Abstract

UDT doesn’t give us conceptual tools for dealing with multiagent coordination problems. There may have initially been some hope, because a UDT player can select a policy which incentivises others to cooperate, or because UDT can reason (EDT-style) that other UDTs are more likely to cooperate if it cooperates itself, or other lines of thought. However, it now appears that this doesn’t pan out, at least not without other conceptual breakthroughs (which I suggest won’t look that much like UDT). I suggest this is connected with UDT’s difficulties handling logical uncertainty.

Introduction

I tend to mostly think of UDT as the ideal, with other decision theories being of interest primarily because we don’t yet know how to generalize UDT beyond the simplistic domain where it definitely makes sense. This perspective has been increasingly problematic for me, however, and I now feel I can say some relatively concrete things about UDT being wrong in spirit rather than only in detail.

Relatedly, in late 2017 I made a post titled Policy Selection Solves Most Problems. Policy selection is the compromise solution which does basically what UDT is supposed to do, without the same degree of conceptual elegance, and without providing the hoped-for clarity which was a major motivation for studying these foundational problems. The current post can be seen as a follow-up to that, giving an idea of the sort of thing which policy selection doesn’t seem to solve.

The argument can also be thought of as an argument against veil-of-ignorance morality of a certain kind.

I don’t think any of this will be really surprising to people who have been thinking about this for a while, but, my thoughts seem to have recently gained clarity and definiteness.

Terminology Notes/References

UDT 1.0, on seeing observation $o b s$ , takes the action $a_{i}$ which maximizes the expected utility of “my code outputs action $a_{i}$ on seeing observation $o b s$ ”, with expected value evaluated according to the prior.

UDT 1.1, on seeing observation $o b s$ , takes the action $a_{i}$ which the globally optimal policy (according to the prior) maps $o b s$ to. This produces the same result as UDT 1.0 in many cases, but ensures that the agent can hunt stag with itself.

UDT 2 is like UDT 1.1, except it (1) represents policies as programs rather than input-output mappings, and (2) dynamically decides how much time to spend thinking about the optimal policy.

What I’m calling “policy selection” is similar to UDT 2. It has a fixed (small) amount of time to choose a policy before thinking more. However, it could always choose the policy of waiting until it has thought longer before it really chooses a strategy, so that’s not so different from dynamically deciding when to choose a policy.

Two Ways UDT Hasn’t Generalized

Logical Uncertainty

UDT 2 tries to tackle the issue of “thinking longer”, which is the issue of logical uncertainty. This is a conceptual problem for UDT, because thinking longer is a kind of updating. UDT is supposed to avoid updating. UDT 2 doesn’t really solve the problem in a nice way.

The problem with thinking for only a short amount of time is that you get bad results. Logical induction, the best theory of logical uncertainty we have at the moment, gives essentially no guarantees about the quality of beliefs at short times. For UDT 2 to work well, it would need early beliefs to at least be good enough to avoid selecting a policy quickly—early beliefs should at least correctly understand how poor-quality they are.

The ideal for UDT is that early beliefs reflect all the possibilities inherent in later updates, so that a policy optimized according to early beliefs reacts appropriately to later computations. Thin priors are one way of thinking of this. So far, nothing like this has been found.

Game Theory

The second way UDT has failed to generalize, and the main topic of this post, is to game theory (ie, multi-agent scenarios). Cousin_it noted that single-player extensive-form games provided a toy model of UDT. The cases where he says that the toy model breaks down are the cases where I am now saying the concept of UDT itself breaks down. Extensive-form games represent the situations where UDT makes real sense: those with no logical uncertainty (or at least, no non-Bayesian phenomena in the logical uncertainty), and, only one agent.

What’s the conceptual problem with extending UDT to multiple agents?

When dealing with updateful agents, UDT has the upper hand. For example, in Chicken-like games, a UDT agent can be a bully, or commit not to respond to bullies. Under the usual game-theoretic assumption that players can determine what strategies each other have selected, the updateful agents are forced to respond optimally to the updateless ones, IE give in to UDT bullies / not bully the un-bullyable UDT.

Put simply, UDT makes its decisions “before” other agents. (The “before” is in logical time, though, not necessarily really before.)

When dealing with other UDT agents, however, the UDT agents have to make a decision “at the same time”.

Naively, the coordination mechanism “write your decisions on slips of paper simultaneously—no peeking!” is a bad one. But this is the whole idea of UDT—writing down its strategy under a “no peeking!” condition.

Other decision theories also have to make decisions “at the same time” in game-theoretic situations, but they don’t operate under the “no peeking”. Guessing the behavior of the other players could be difficult, but the agent can draw on past experience to help solve this problem. UDT doesn’t have this advantage.

Furthermore, we’re asking more of UDT agents. When faced with a situation involving other UDT agents, UDT is supposed to “handshake”—Löbian handshakes being at least a toy model—and find a cooperative solution (to the extent that there is one).

So far, models of how handshakes could occur have been limited to special cases or unrealistic assumptions. (I’d like to write a full review—I think there’s some non-obvious stuff going on—but for this post I think I’d better focus on what I see as the fundamental problem.) I’d like to see better models, but, I suspect that significant departures from UDT will be required.

Even if you don’t try to get UDT agents to cooperate with each other, though, the conceptual problem remains—UDT is going in blind. It has a much lower ability to determine what equilibrium it is in.

I think there is a deep relationship between the issue with logical uncertainty and the issue with game theory. A simple motivating example is Agent Simulates Predictor, which appears to be strongly connected to both issues.

How does Equilibrium Selection Work?

The problem I’m pointing to is the problem of equilibrium selection. How are two UDT agents supposed to predict each other? How can they trust each other?

There are many different ways to think about agents ending up in game-theoretic equilibria. Most of them, as I understand it, rely on iterating the game so that the agents can learn about it. This iteration can be thought of as really occurring, or as occurring in the imagination of the players (an approach called “fictitious play”). Often, these stories result in agents playing correlated equilibria, rather than Nash equilibria. However, that’s not a very big difference for our purposes here—correlated equilibria only allow the DD outcome in Prisoner’s Dilemma, just like Nash.

There’s something absurd about using iterated play to learn single-shot strategies, a problem Yoav Shoham et al discuss in If multi-agent learning is the answer, what is the question?. If the game is iterated, what stops agents from taking advantage of its iterated nature?

That’s the essence of my argument in In Logical Time, All Games are Iterated Games—in order to learn to reason about each other, agents use fictitious play, or something similar. But this turns the game into an iterated game.

Turning a game into an iterated game can create a lot of opportunity for coordination, but the Folk Theorem says that it also creates a very large equilibrium selection problem. The Folk Theorem indicates that rational players can end up in very bad outcomes. Furthermore, we’ve found this difficult to avoid in decision algorithms we know how to write down. How can we eliminate the “bad” equilibria and keep only the “good” possibilities?

What we’ve accomplished is the reduction of the “handshake” problem to the problem of avoiding bad equilibria. (We could say this turns prisoner’s dilemma into stag hunt.)

Handshake or no handshake, however, the “fictitious play” view suggests that equilibrium selection requires learning. You can get agents into equilibria without learning, but the setups seem artificial (so far as I’ve seen). This requires updateful reasoning in some sense. (Although, it can be a logical update only; being empirically updateless still seems wiser).

Logical Uncertainty & Games

Taking this idea a little further, we can relate logical uncertainty and games via the following idea:

Our uncertain expectations are a statistical summary of how things have gone in similar situations in the (logical) past. The way we react to what we see can be thought of as an iterated strategy which depends on the overall statistics of that history (rather than a single previous round).

I’m not confident this analogy is a good one—in particular, the way policies have to depend on statistical summaries of the history rather than on specific previous rounds is a bit frustrating. However, the analogy goes deeper than I’m going to spell out here. (Perhaps in a different post.)

One interesting point in favor of this analogy: it also works for modal agents. The proof operator, $□$ , is like a “prediction”: proofs are how modal agents thinks about the world in order to figure out what to do. So $□ x$ is like “the agent thinks $x$ ”. If you look at how modal agents are actually computed in the MIRI guide to Löb’s theorem, it looks like an iterated game, and $□$ looks like a simple kind of summary of the game so far. On any round, $□ x$ is true if and only if $x$ has been true in every previous round. So, you can think of $□ x$ as ” $x$ has held up so far”—as soon as $x$ turns out to be false once, $□ x$ is never true again.

In this interpretation, FairBot (the strategy of cooperating if and only if the other player provably cooperates) becomes the “Grim Trigger” strategy: cooperate on the first round, and cooperate on every subsequent round so long as the other player has cooperated so far. If the other player ever defects, switch to defecting, and never cooperates again.

A take-away for the broader purpose of this post could be: one of the best models we have of the UDT “handshake” is the Grim Trigger strategy in disguise. This sets the tone nicely for what follows.

My point in offering this analogy, however, is to drive home the idea that game-theoretic reasoning requires learning. Even logic-based agents can be understood as running simple learning algorithms, “updating” on “experience” from (counter)logical possibilities. UDT can’t dance with its eyes closed.

This is far from a proof of anything; I’m just conveying intuitions here.

What UDT Wants

One way to look at what UDT is trying to do is to think of it as always trying to win a “most meta” competition. UDT doesn’t want to look at any information until it has determined the best way to use that information. UDT doesn’t want to make any decisions directly; it wants to find optimal policies. UDT doesn’t want to participate in the usual game-theoretic setup where it (somehow) knows all other agent’s policies and has to react; instead, it wants to understand how those policies come about, and act in a way which maximally shapes that process to its benefit.

It wants to move first in every game.

Actually, that’s not right: it wants the option of moving first. Deciding earlier is always better, if one of the options is to decide later.

It wants to announce its binding commitments before anyone else has a chance to, so that everyone has to react to the rules it sets. It wants to set the equilibrium as it chooses. Yet, at the same time, it wants to understand how everyone else will react. It would like to understand all other agents in detail, their behavior a function of itself.

So, what happens if you put two such agents in a room together?

Both agents race to decide how to decide first. Each strives to understand the other agent’s behavior as a function of its own, to select the best policy for dealing with the other. Yet, such examination of the other needs to itself be done in an updateless way. It’s a race to make the most uninformed decision.

I claim this isn’t a very good coordination strategy.

One issue is that jumping up a meta-level increases the complexity of a decision. Deciding a single action is much easier than deciding on a whole policy. Some kind of race to increasing meta-levels makes decisions increasingly challenging.

At the same time, the desire for your policy to be logically earlier than everyone else, so that they account for your commitments in making their decisions, means you have to make your decisions faster and in simpler, more predictable ways.

The expanding meta-space and the contracting time do not seem like a good match. You have to make a more complicated decision via less-capable means.

Two people trying to decide policies early are just like two people trying to decide actions late, but with more options and less time to think. It doesn’t seem to solve the fundamental coordination problem.

The race for most-meta is only one possible intuition about what UDT is trying to be. Perhaps there is a more useful one, which could lead to better generalizations.

Veils of Ignorance

UDT tries to coordinate with itself by stepping behind a veil. In doing so, it fails to coordinate with others.

Veil-of-ignorance moral theories describe multiagent coordination resulting from stepping behind a veil. But there is a serious problem. How can everyone step behind the same veil? You can’t tell what veil everyone else stepped behind if you stay behind your own veil.

UDT can successfully self-coordinate in this way because it is very reasonable to use the common prior assumption with a single agent. There is no good reason to suppose this in the multiagent case. In practice, the common prior assumption is a good approximation of reality because everyone has dealt with essentially the same reality for a long time and has learned a lot about it. But if we have everyone step behind a veil of ignorance, there is no reason to suppose they know how to construct the same veil as each other—they’re ignorant!

Is UDT Almost Right, Nonetheless?

I find myself in an awkward position. I still think UDT gets a lot of things right. Certainly, it still seems worth being updateless about empirical uncertainty. It doesn’t seem to make sense for logical uncertainty… but treating logical and empirical uncertainty in such different ways is quite uncomfortable. My intuition is that there should not be a clean division between the two.

One possible reaction to all this is to try to learn to be updateless. IE, don’t actually try to be updateless, but do try to get the problems right which UDT got right. Don’t expect everything to go well with a fixed Bayesian prior, but try to specify the learning-theoretic properties which approximate that ideal.

Would such an approach do anything to help multiagent coordination? Unclear. Thermodynamic self-modification hierarchies might work with this kind of approach.

In terms of veil-of-ignorance morality, it seems potentially helpful. Take away everything we’ve learned, and we don’t know how to cooperate from behind our individual veils of ignorance. But if we each have a veil of ignorance which is carefully constructed, a learned-updateless view which accurately reflects the possibilities in some sense, they seem more likely to match up and enable coordination.

Or perhaps a more radical departure form the UDT ontology is needed.

What links here?