Lucky Omega Problem

This is a later better version of the problem in this post. This problem emerged from my work in the “Deconfusing Commitment Races” project under the Supervised Program for Alignment Research (SPAR), led by James Faville. I’m grateful to SPAR for providing the intellectual environment and to James Faville personally for intellectual discussions and help with the draft of this post. Any mistakes are my own.

I used Claude and Gemini to help me with phrasing and grammar in some parts of this post.

Problem

There once lived an alien named Omega who enjoyed giving the Programmer decision theory problems. The answer to each one had to be a program-player that would play the game presented in the problem. Based on the results of the game, the Programmer would receive some amount of utility.

Omega had incredibly large, but still not infinite computational power, so he only accepted programs from set - the set of all programs written in a certain fixed programming language that contain no more than commands. If the program doesn’t halt after steps, Omega stops it and uses empty output.

After approximately three million problems Programmer got tired and wrote a code for a universal consequentialist Agent that optimizes Programmer’s utility. Now when Omega gives Programmer a problem, Programmer just inserts a statement of the problem into a string constant in this Agent and sends it.

This is the newest Omega’s problem:

Omega randomly selects a program from .
The program-player receives as input “YES” or “NO”—an honest answer to the question “do the source code of equals the source code of the program-player?”
independently receives “YES” as input regardless of anything.

Then the program-player and X play a version of the Prisoner’s Dilemma:

  • If the program-player outputs “COOPERATE” and does not, the Programmer receives nothing.

  • If neither the program-player nor outputs “COOPERATE”, the Programmer receives unit of utility.

  • If both the program-player and output “COOPERATE”, the Programmer receives units of utility.

  • Finally, if the program-player does not output “COOPERATE”, but does, the Programmer receives units of utility.

The Programmer, as usual, inserted the problem statement into the Agent and sent it as the program-player.

Diagram of the scenario

The Agent received the input “Yes”. It seems Omega was incredibly lucky! What should it do, output “COOPERATE” or not?

Reducing Agent’s extreme anthropic uncertainty

You can consider this section as one big footnote. It fixes some exploitable issues in the problem, but doesn’t change its main idea.

We can worry that the Agent would have extreme anthropic uncertainty: not only “I can be ” (that was intended), but also “I can be the part of who will choose the opposite action” and “If I’m I don’t actually know what is the problem statement, because my knowledge of it comes from my own source code and ’s source code is random so now I has to check my prior over possible problem statements”. These issues were not intended, so let’s make the following corrections:

  1. Our programming language has “” function that returns the source code of a program it’s in.

  2. Omega chooses a random program from not with uniform distribution, but with some fixed distribution that strongly prefers simpler programs, so Agent can be almost sure that the output of “” function is indeed its source code and it isn’t being run inside a larger sandbox program.

  3. Programmer writes code that is very optimized in length, so Agent doesn’t have to worry about being run inside a shorter sandbox program.

  4. Corrections 2 and 3 becomes the part of the problem statement given to the Programmer and inserted into Agent’s code.

  5. Agent’s utility function is “Utility of the Programmer if the problem I’m in is the same problem described in my source code, and otherwise”. So Agent can safely assume it’s really inside this exact problem, because if it’s not true, its actions don’t influence its utility.

Different approaches and decision procedures

I discovered that different decision procedures handle the described situation differently.

Anthropic approach /​ CDT with anthropic uncertainty

Input “YES” is very strong evidence that Agent is actually and not program-player (because always receives input “YES” and the program-player receives input “YES” with very small probability). Output “COOPERATE” gives utility if Agent is and utility if Agent is program-player, and Agent should think it’s probably . So Agent should output “COOPERATE”.

Functional approach /​ FDT

Agent should consider itself a function that definitely determines the behaviour of and may be also the behaviour of program-player (by the same logic as in anthropic approach). Agent wants to output “COOPERATE”. It also wants program-player to output something else, but it’s less important ( utility vs. ) and also it probably doesn’t control program-player’s behaviour. So Agent should output “COOPERATE”.

Universal precommitment approach /​ UDT

Agent should behave as the best program in its place would. Programmer would want to send a program that never outputs “COOPERATE”. (Let’s call this program . It receives maximum possible utility against any possible including . Other program can receive more utility against its own copy then against , but it doesn’t help it against . Also, receives strictly more utility in the case when equals this other program.) So Agent should not output “COOPERATE”.

Logical approach /​ Probably, some version of LDT

Agent is logically entangled with all programs which implement the same decision procedure. Many such programs are in . Programmer’s ex-ante expected utility is higher if Agent’s decision procedure makes so all these programs output “COOPERATE”. So Agent should output “COOPERATE”.

Possible takeaways

Notice that all three approaches that advice to output “COOPERATE” would nonetheless agree that it would be better for the Programmer to send always-defecting program instead of Agent.

So we have two options, and both are in some sense weird:

  • Agent should output “COOPERATE”. Sometimes simple not-agentic, not-consequentialist procedures achieve better expected utility than perfect consequentialist agents even in cases where no one tries to predict this agent in particular (probability that equals Agent doesn’t depend on whether program-player is Agent). Maybe also: Strategically relevant predictions can happen accidentally, there is no requirement of ex-ante reason to expect that the prediction will be correct.

  • Agent shouldn’t output “COOPERATE”. Direct logical entanglement of agent’s behaviour with something that influences agent’s utility isn’t necessary strategically relevant. Maybe also: Prediction of an agents behaviour can be strategically relevant for this agent only if there is something such that prediction and behaviour itself are causally downstream from it.

Also, the disagreement between UDT and CDT with anthropic uncertainty seemingly contradicts this post. Probably there is no real contradiction and this situation just doesn’t satisfy some required assumptions, but maybe it deserves future research.