Wei Dai comments on Eli’s shortform feed

Wei Dai 16 Feb 2026 1:02 UTC
21 points
2
Eliezer 2018

I do also remark that there are multiple fixpoints in decision theory. CDT does not evolve into FDT but into a weirder system Son-of-CDT. So, as with utility functions, there are bits we want that the AI does not necessarily generate from self-improvement or local competence gains.

2018 FB discussion

Michael Cohen

I think, with >99% prob., if we make an aligned superintelligent causal decision theorist, that we get an “existential win.” I take the “MIRI view” on pretty much every other point, so I’m more inclined than I otherwise would be in investigate the possibility of something I currently take to be very unlikely: that our lives depend on another decision theory. Is there anyone (perhaps someone at MIRI) who can explain the claim or point me to a link explaining why they expect an aligned causal decision theorist to fail? To kick off the discussion, I understand that at t=2, it’s not a causal decision theorist anymore, but I trust the updated agent too, if the causal decision theorist ceded power to it.

The relevance of my position would be to shift decision theory resources to ontology identification, naturalized induction, and generalizable environmental goals—things where I would pretty surprised if we make aligned AGI without understanding how to solve those problems.

Eliezer Yudkowsky

I think your first AGI is supposed to do some non-ambitiously-aligned boundedly-accomplishable task that causes the world to not be destroyed by the next, unaligned AGI built 3 months later. You could plausibly get away with using a CDT agent for this so long as the agent only thought about a bounded range of stuff near in space, time, and probability; was not generally superintelligent across all domains; and was not freely self-modifying so that it kept those properties.

Otherwise an LDT agent can take all of a CDT agent’s marbles, or take nearly all of the gains from trade across even positive-sum interactions, via e.g. the LDT agent predictably refusing any offer short of $9.99 in the Ultimatum Game. The CDT agent building the Son-of-CDT agent will likewise find, in a way that it thinks has nothing to do with its physical acts, that the LDT agent seems to have predictably configured itself to reject offers less than $9.99 from whatever next agent the CDT agent builds, so this exploitability is a reflectively stable property of the descendants of fully reflective CDT agents. This makes CDT unsuitable for ambitious alignment projects like CEV, but you shouldn’t be trying that anyways on your first AGI.

Michael Cohen

Suppose the agent is freely self-modifying and unbounded, and no other AGI exists yet. Can’t it do all this reasoning above and make sure it has replaced itself with an unexploitable AGI by the time any scary adversary comes along?

Eliezer Yudkowsky

No. A CDT agent predicts that an LDT agent will ignore its attempt to be unexploitable and go on rejecting offers short of $9.99. The problem is that the CDT agent is incapable of seeing controllability in the connection between its own abstract computation to accept the unfair offer in the Ultimatum Game, and the LDT agent having already simulated that computation and figured out how to exploit it.

The CDT replicates this property in its offspring because the CDT agent does not see a controllable connection between its computed choice to build an offspring like that and the LDT agent’s choice to reject offers short of $9.99 from its offspring. CDT agents don’t build LDT agents, they build weird reflectively consistent agents we label Son-of-CDT which continue to think that nothing is controllable unless it stems from a physical consequence of the CDT agent picking a particular computation. Tl;dr CDT agents lose all the precommitment wars and so do other agents build by CDT agents.

Eliezer Yudkowsky

A simple gloss on why you ought to understand LDT before building a CDT Task AGI is that this study tells us ways we need to restrict a CDT Task AGI’s cognition to prevent it from asploding.

Eliezer Yudkowsky

Or to put it another way, if you write the most sophisticated program you are smart enough to write, you are not smart enough to debug it. A corollary of this principle is that you ought to understand the theory of the program one step more sophisticated than the less sophisticated program you actually write, in order to understand all the tradeoffs you are making by writing the less sophisticated program. It’s fine if you say that a mergesort is simpler than a quicksort, but I’d much more trust somebody who understood quicksort who then said that they needed to write a mergesort because that was easier to debug, as opposed to somebody who didn’t understand quicksort saying that they were going to write a mergesort because that was the best sorting algorithm they thought they could grasp. You need conceptual error margin between what you think you could build and what you try to actually build, so that you understand what lies on both sides of the theory-boundary around your choice to use a theory at that specific level of complication.

Michael Cohen

I’m skeptical that “son-of-cdt” is something the future ldt can define well. Certainly it doesn’t mean any agent created after the cdt agent exists, because that would be all agents, and the fact that there is has been one human cdt in the past would have resolved the ldts behavior already. There are, of course, lots of ways for our cdt AGI to create new agents that are aligned with it. How is the ldt supposed to define whether the cdt AGI made this new agent for which it is wondering whether to treat like as a sucker? If we created an ldt AGI instead, but the human creator of the ldt AGI had a cdt ancestor, will the future ldt AGI only offer our ldt AGI one cent in the ultimatum game?

(Tangent: this is all from the intuition that most things shouldn’t depend on the continuity of the self, or whatever the term is for what parfit rejects. Continuity of the self seems like an even more dubious concept for AIs where “instances” of a “self” can branch in time rather than one succeeding the next with every passing second.)

tl;dr any action the cdt takes will change the expected number of agents aligned with it that don’t get bullied by the ldt in the future. Which cases are “its sons”?

Eliezer Yudkowsky

Son-of-CDT is a particular algorithm that it seems CDT agents should build in a pretty convergent way. Its properties are well-defined to the extent CDT is well-defined. Taking all of its marbles is as easy as taking all of CDT’s marbles.

Michael Cohen

I’m not sold on that. Son-of-CDT could be literally anything, and it depends on empirical facts, like what sort of adversaries are likely to exist in the future. Indeed, any agent that could be created in the interim between cdt-birth and ldt-birth will be considered by cdt, and if ldt treats any of them fairly, that’s the sort of agent that will appear. If none of them will be treated fairly, that means ldt gives this treatment to all agents born after a cdt has existed (and that ship has sailed).

Michael Cohen

A separate sufficient objection (and if my case rests on this I acknowledge a good deal of capitulation): if it’s the first agi and it’s unbounded, it will secure a decisive advantage and make sure that nothing that can challenge it ever gets made. EDIT: In fact, even if it is bounded, it only meets our standard for an existential win if it is able to ensure than nothing that can challenge it ever gets made. So once we’ve figured out how to do that, there are no additional gains from giving it a decision theory besides cdt.

Christopher Leong

What’s LDT and is there a paper on how LDT agents can run this exploit?

Eliezer Yudkowsky

Try Logical decision theories.

Linda Linsefors

A CDT agent predicts that an LDT agent will ignore its attempt to be unexploitable and go on rejecting offers short of $9.99. The problem is that the CDT agent is incapable of seeing controllability in the connection between its own abstract computation to accept the unfair offer in the Ultimatum Game, and the LDT agent having already simulated that computation and figured out how to exploit it.

You are assuming that the LDT agent knows the history of the son-of-CDT is. How did the LDT agent get this information?

The son-of-CDT agent will notice that if it can hide its history from the LDT agent, e.g. tricking it into believing that it was always LDT-like, or just making the LDT uncertain about the origin of the son-of-CDT, then, with the right self modifications, the son-of-CDT will not be exploitable.

Me (Wei Dai) 2019

One meta level above what even UDT tries to be is decision theory (as a philosophical subject) and one level above that is metaphilosophy, and my current thinking is that it seems bad (potentially dangerous or regretful) to put any significant (i.e., superhuman) amount of computation into anything except doing philosophy.

To put it another way, any decision theory that we come up with might have some kind of flaw that other agents can exploit, or just a flaw in general, such as in how well it cooperates or negotiates with or exploits other agents (which might include how quickly/cleverly it can make the necessary commitments). Wouldn’t it be better to put computation into trying to find and fix such flaws (in other words, coming up with better decision theories) than into any particular object-level decision theory, at least until the superhuman philosophical computation itself decides to start doing the latter?

(Current thoughts) I’m not sure why SIAI/MIRI didn’t take a position more similar to mine (which you may share, if in “Why wouldn’t an early seed AI reason about the ways that it’s decision theory makes it exploitable” you mean philosophical reasoning as opposed to “reasoning through its current decision theory”) that ideally the seed AI would solve decision theory for itself by doing good philosophy. Eliezer may have written one comment (that I can recall) where he said something about this, but given lack of good searching abilities I’ll have to spin up an AI-assisted project to find it.

Edit: Forgot to mention that I think the AI solving decision theory (and other philosophical problems) incorrectly, by doing bad philosophy, is a serious risk. Not so much that it would pick CDT or Son-of-CDT, but messing up more subtly, like putting oneself in a worse bargaining position by simulating another agent for too long (thereby giving the other agent an incentive to commit to only accepting a deal favorable to themselves). Or more generally not figuring out a solution to such commitment race problems, or the many other tricky decision theory problems.
- Wei Dai 18 Feb 2026 2:34 UTC
  11 points
  2
  Parent
  I found^[1] Eliezer’s 2012 comment where he talked about why he didn’t want FAI to solve philosophical problems for itself:
  I have been publicly and repeatedly skeptical of any proposal to make an AI compute the answer to a philosophical question you don’t know how to solve yourself, not because it’s impossible in principle, but because it seems quite improbable and definitely very unreliable to claim that you know that computation X will output the correct answer to a philosophical problem and yet you’ve got no idea how to solve it yourself. Philosophical problems are not problems because they are well-specified and yet too computationally intensive for any one human mind. They’re problems because we don’t know what procedure will output the right answer, and if we had that procedure we would probably be able to compute the answer ourselves using relatively little computing power. Imagine someone telling you they’d written a program requiring a thousand CPU-years of computing time to solve the free will problem.
  Interesting to compare this to my Some Thoughts on Metaphilosophy, where I argued for the opposite.
  1. ^
    Using my recently resurrected LW Power Reader & User Archive userscript. The User Archive part allows one to create an offline archive (in browser storage) of someone’s complete LW content and then do a search like /philosoph/ replyto:Wei_Dai

Wei Dai comments on Eli’s shortform feed

Eliezer 2018

2018 FB discussion

Me (Wei Dai) 2019