Lucky Omega Problem

Tapatakt13 Jun 2025 14:54 UTC

10 points

This is a later better version of the problem in this post. This problem emerged from my work in the “Deconfusing Commitment Races” project under the Supervised Program for Alignment Research (SPAR), led by James Faville. I’m grateful to SPAR for providing the intellectual environment and to James Faville personally for intellectual discussions and help with the draft of this post. Any mistakes are my own.

I used Claude and Gemini to help me with phrasing and grammar in some parts of this post.

Problem

There once lived an alien named Omega who enjoyed giving the Programmer decision theory problems. The answer to each one had to be a program-player that would play the game presented in the problem. Based on the results of the game, the Programmer would receive some amount of utility.

Omega had incredibly large, but still not infinite computational power, so he only accepted programs from set $S$ - the set of all programs written in a certain fixed programming language that contain no more than $10^{100}$ commands. If the program doesn’t halt after $10^{100}$ steps, Omega stops it and uses empty output.

After approximately three million problems Programmer got tired and wrote a code for a universal consequentialist Agent that optimizes Programmer’s utility. Now when Omega gives Programmer a problem, Programmer just inserts a statement of the problem into a string constant in this Agent and sends it.

This is the newest Omega’s problem:

Omega randomly selects a program $X$ from $S$ .
The program-player receives as input “YES” or “NO”—an honest answer to the question “do the source code of $X$ equals the source code of the program-player?”
$X$ independently receives “YES” as input regardless of anything.
Then the program-player and X play a version of the Prisoner’s Dilemma:
If the program-player outputs “COOPERATE” and $X$ does not, the Programmer receives nothing.
If neither the program-player nor $X$ outputs “COOPERATE”, the Programmer receives $1$ unit of utility.
If both the program-player and $X$ output “COOPERATE”, the Programmer receives $2$ units of utility.
Finally, if the program-player does not output “COOPERATE”, but $X$ does, the Programmer receives $3$ units of utility.

The Programmer, as usual, inserted the problem statement into the Agent and sent it as the program-player.

The Agent received the input “Yes”. It seems Omega was incredibly lucky! What should it do, output “COOPERATE” or not?

Reducing Agent’s extreme anthropic uncertainty

You can consider this section as one big footnote. It fixes some exploitable issues in the problem, but doesn’t change its main idea.

We can worry that the Agent would have extreme anthropic uncertainty: not only “I can be $X$ ” (that was intended), but also “I can be the part of $X$ who will choose the opposite action” and “If I’m $X$ I don’t actually know what is the problem statement, because my knowledge of it comes from my own source code and $X$ ’s source code is random so now I has to check my prior over possible problem statements”. These issues were not intended, so let’s make the following corrections:

Our programming language has “ $G e t S o u r c e C o d e ()$ ” function that returns the source code of a program it’s in.
Omega chooses a random program from $S$ not with uniform distribution, but with some fixed distribution $P$ that strongly prefers simpler programs, so Agent can be almost sure that the output of “ $G e t S o u r c e C o d e ()$ ” function is indeed its source code and it isn’t being run inside a larger sandbox program.
Programmer writes code that is very optimized in length, so Agent doesn’t have to worry about being run inside a shorter sandbox program.
Corrections 2 and 3 becomes the part of the problem statement given to the Programmer and inserted into Agent’s code.
Agent’s utility function is “Utility of the Programmer if the problem I’m in is the same problem described in my source code, and $0$ otherwise”. So Agent can safely assume it’s really inside this exact problem, because if it’s not true, its actions don’t influence its utility.

Different approaches and decision procedures

I discovered that different decision procedures handle the described situation differently.

Anthropic approach / CDT with anthropic uncertainty

Input “YES” is very strong evidence that Agent is actually $X$ and not program-player (because $X$ always receives input “YES” and the program-player receives input “YES” with very small probability). Output “COOPERATE” gives $2$ utility if Agent is $X$ and $- 1$ utility if Agent is program-player, and Agent should think it’s probably $X$ . So Agent should output “COOPERATE”.

Functional approach / FDT

Agent should consider itself a function that definitely determines the behaviour of $X$ and may be also the behaviour of program-player (by the same logic as in anthropic approach). Agent wants $X$ to output “COOPERATE”. It also wants program-player to output something else, but it’s less important ( $1$ utility vs. $2$ ) and also it probably doesn’t control program-player’s behaviour. So Agent should output “COOPERATE”.

Universal precommitment approach / UDT

Agent should behave as the best program in its place would. Programmer would want to send a program that never outputs “COOPERATE”. (Let’s call this program $D$ . It receives maximum possible utility against any possible $X$ including $X = D$ . Other program can receive more utility against its own copy then $D$ against $X = D$ , but it doesn’t help it against $X = D$ . Also, $D$ receives strictly more utility in the case when $X$ equals this other program.) So Agent should not output “COOPERATE”.

Logical approach / Probably, some version of LDT

Agent is logically entangled with all programs which implement the same decision procedure. Many such programs are in $S$ . Programmer’s ex-ante expected utility is higher if Agent’s decision procedure makes so all these programs output “COOPERATE”. So Agent should output “COOPERATE”.

Possible takeaways

Notice that all three approaches that advice to output “COOPERATE” would nonetheless agree that it would be better for the Programmer to send always-defecting program instead of Agent.

So we have two options, and both are in some sense weird:

Agent should output “COOPERATE”. Sometimes simple not-agentic, not-consequentialist procedures achieve better expected utility than perfect consequentialist agents even in cases where no one tries to predict this agent in particular (probability that $X$ equals Agent doesn’t depend on whether program-player is Agent). Maybe also: Strategically relevant predictions can happen accidentally, there is no requirement of ex-ante reason to expect that the prediction will be correct.
Agent shouldn’t output “COOPERATE”. Direct logical entanglement of agent’s behaviour with something that influences agent’s utility isn’t necessary strategically relevant. Maybe also: Prediction of an agents behaviour can be strategically relevant for this agent only if there is something such that prediction and behaviour itself are causally downstream from it.

Also, the disagreement between UDT and CDT with anthropic uncertainty seemingly contradicts this post. Probably there is no real contradiction and this situation just doesn’t satisfy some required assumptions, but maybe it deserves future research.

Tapatakt13 Jun 2025 14:54 UTC

10 points

4 comments4 min readLW link

World Modeling

Martín Soto 13 Jun 2025 20:21 UTC
4 points
0
I think this is the right way to research decision theory!
This is basically a rehash of my comment in your previous post, but I think you are confused in a very particular way which I am not. You are confusing “optimizing with the assumption Agent=X” with “optimizing without that assumption”. In other words, optimizing for a decision problem were Omega always samples Agent=X, versus optimizing for your actual described decision problem were Omega samples X randomly.
For example, you describe the first case as one “where no one tries to predict this agent in particular”. But actually, if we assume Agent=X, this is by definition “Omega predicting Agent perfectly” (even if, from the third-person perspective of the Programmer, this happened randomly, that is, it seemed unlikely a priori, and was very surprising). You also describe the second case as “direct logical entanglement of agent’s behaviour with something that influences agent’s utility”. But actually, if you are assuming that Omega samples X randomly, then X isn’t entangled (correlated) with Agent in any way, by definition.
Here’s another way to highlight your confusion. Say Programmer is thinking about what Agent to implement. The following are two different (even if similar-sounding) statements.
- Conditional on Agent seeing a Yes (that is, assuming that Omega always samples X=Agent), Programmer wants to submit an Agent that Cooperates (upon seeing Yes, of course, since that’s all it will ever see).
- Without conditioning on anything (that is, assuming that Omega sample X randomly), Programmer wants to submit an Agent that, upon seeing a Yes, Cooperates.
The former is true, while the latter is false.
So the only question is: Does Agent, the general reasoner created by Programmer, want to maximize its utility according to the former (updated) or the latter (non-updated) probability distribution?
Or in other words: Does Agent, even upon updating on its observation, care about what happens in other counterfactual branches (different from the one that turned out to be the case)?
It can seem jarring at first that there is not a single clean solution that allows all versions of a reasoner (updated and non-updated) to optimize together and get absolutely everything they want. But it’s really not surprising when you look at it: optimizing across many branches at once (because you still don’t know which one will materialize) will lead to different behavior than optimizing for one branch alone.
So then, what should a general-reasoning Agent actually do in this situation? What would I do if I suddenly learned I have been living this situation. What probability distribution should we optimize?
- On the one hand, it sounds pretty natural to optimize from their updated distribution. After all, they have just come into existence. All they have ever known is Yes. They have never, literally speaking, been uncertain about whether they would see a Yes. They could reconstruct an artificial updateless prior further into the past (that they have never actually believed for a moment), but that can feel unnatural.
- On the other hand, clearly Programmer would have preferred for it to maximize updatelessly, and would have hard-coded that in if they hadn’t been so sloppy when programming Agent. There is a sense in which the designed purpose of Agent as a program would have been better achieved (according to Programmer) by maximizing updatelessly, so maybe we should just right this wrong now (despite Agent’s updated beliefs).
This choice is just totally up for grabs! We’ll have to query our normative intuitions for which one seems to better fit the spirit of what we want to call “maximization”. Again, there is no deep secret to be uncovered here: maximizing from a random distribution, will recommend different actions than maximizing from a conditioned distribution, will recommend different actions than maximizing from a crazy distribution where unicorns exist.
I myself find it intuitive to be updateful here. That just looks more like maximization from my current epistemic state, and heck, it was the Programmer, from their updateless perspective, who could have just made me into a dumb Defect-rock, and it would have worked better for themselves! But I think this is just a subjective take based on vibes, and other decision theorists might have the opposite take, and it might be irreconcilable.
What links here?
- Martín Soto's comment on An alignment safety case sketch based on debate by Marie_DB (17 Jun 2025 12:54 UTC; 2 points)
- Tapatakt 14 Jun 2025 15:55 UTC
  2 points
  0
  Parent
  Do you also prefer to not pay in Counterfactual Mugging?
  - Martín Soto 14 Jun 2025 17:22 UTC
    2 points
    0
    Parent
    Depends on the complexity of the logical coin. Certainly not for 1+1=2. But probably yes for appropriately complex statements. This is due to strong immediate identification with “my immediately past self who didn’t yet know the truth value”, and an understanding that “he (my past self) cannot literally rewrite my brain at will to ensure this behavior holds, but it’s understood that I will play along to some extent to satisfy his vision (otherwise he would have to invest more in binding my behavior, which sounds like a waste)”.
    (Of course, I need some kind of proof that the statement has been chosen non-adversarially, and I’m not yet sure that is possible)
cousin_it 13 Jun 2025 15:24 UTC
2 points
0
I think the universal precommitment / UDT solution is right, and don’t quite understand what’s weird about it.

Lucky Omega Problem

Problem

Reducing Agent’s extreme anthropic uncertainty

Different approaches and decision procedures

Anthropic approach /​ CDT with anthropic uncertainty

Functional approach /​ FDT

Universal precommitment approach /​ UDT

Logical approach /​ Probably, some version of LDT

Possible takeaways

Anthropic approach / CDT with anthropic uncertainty

Functional approach / FDT

Universal precommitment approach / UDT

Logical approach / Probably, some version of LDT