Eli Tyre comments on Eli’s shortform feed

Eli Tyre 15 Feb 2026 3:15 UTC
21 points
1
Does anyone know why the early Singularity Institute prioritized finding the correct solution to decision theory as an important subproblem of building a Friendly AI?

Wei Dai recently said that the concern was something like...
we have to fully solve DT before building AGI/ASI, otherwise it could be catastrophic due to something like the AI falling prey to an acausal threat or commitment races, or can’t cooperate with other AIs.
This seems like a very surprising reason to me. I don’t understand why this problem needed to be solved before the intelligence explosion.

The early singularity institute imagined building a seed AI that would recursively self improve into a full superintelligence. The default expectation was that many to all of the components of the seed AI would be flawed. The AI would reflect on it’s processes, evaluate their effectiveness and optimality, and then replace them with improved versions.

Why wouldn’t an early seed AI reason about the ways that it’s decision theory makes it exploitable, or the ways it’s decision theory which it bars it from cooperation with distant superintelligence (just as the the researchers at SI were doing), find the best solution to those problems, and then modify the decision theory?

Why was it thought that we had to get decision theory right in the initial conditions instead of being just one more thing that the AI would iron out on the way to superintelligence?
- Wei Dai 16 Feb 2026 1:02 UTC
  21 points
  2
  Parent
  Eliezer 2018
  
  I do also remark that there are multiple fixpoints in decision theory. CDT does not evolve into FDT but into a weirder system Son-of-CDT. So, as with utility functions, there are bits we want that the AI does not necessarily generate from self-improvement or local competence gains.
  
  2018 FB discussion
  
  Michael Cohen
  
  I think, with >99% prob., if we make an aligned superintelligent causal decision theorist, that we get an “existential win.” I take the “MIRI view” on pretty much every other point, so I’m more inclined than I otherwise would be in investigate the possibility of something I currently take to be very unlikely: that our lives depend on another decision theory. Is there anyone (perhaps someone at MIRI) who can explain the claim or point me to a link explaining why they expect an aligned causal decision theorist to fail? To kick off the discussion, I understand that at t=2, it’s not a causal decision theorist anymore, but I trust the updated agent too, if the causal decision theorist ceded power to it.
  
  The relevance of my position would be to shift decision theory resources to ontology identification, naturalized induction, and generalizable environmental goals—things where I would pretty surprised if we make aligned AGI without understanding how to solve those problems.
  
  Eliezer Yudkowsky
  
  I think your first AGI is supposed to do some non-ambitiously-aligned boundedly-accomplishable task that causes the world to not be destroyed by the next, unaligned AGI built 3 months later. You could plausibly get away with using a CDT agent for this so long as the agent only thought about a bounded range of stuff near in space, time, and probability; was not generally superintelligent across all domains; and was not freely self-modifying so that it kept those properties.
  
  Otherwise an LDT agent can take all of a CDT agent’s marbles, or take nearly all of the gains from trade across even positive-sum interactions, via e.g. the LDT agent predictably refusing any offer short of $9.99 in the Ultimatum Game. The CDT agent building the Son-of-CDT agent will likewise find, in a way that it thinks has nothing to do with its physical acts, that the LDT agent seems to have predictably configured itself to reject offers less than $9.99 from whatever next agent the CDT agent builds, so this exploitability is a reflectively stable property of the descendants of fully reflective CDT agents. This makes CDT unsuitable for ambitious alignment projects like CEV, but you shouldn’t be trying that anyways on your first AGI.
  
  Michael Cohen
  
  Suppose the agent is freely self-modifying and unbounded, and no other AGI exists yet. Can’t it do all this reasoning above and make sure it has replaced itself with an unexploitable AGI by the time any scary adversary comes along?
  
  Eliezer Yudkowsky
  
  No. A CDT agent predicts that an LDT agent will ignore its attempt to be unexploitable and go on rejecting offers short of $9.99. The problem is that the CDT agent is incapable of seeing controllability in the connection between its own abstract computation to accept the unfair offer in the Ultimatum Game, and the LDT agent having already simulated that computation and figured out how to exploit it.
  
  The CDT replicates this property in its offspring because the CDT agent does not see a controllable connection between its computed choice to build an offspring like that and the LDT agent’s choice to reject offers short of $9.99 from its offspring. CDT agents don’t build LDT agents, they build weird reflectively consistent agents we label Son-of-CDT which continue to think that nothing is controllable unless it stems from a physical consequence of the CDT agent picking a particular computation. Tl;dr CDT agents lose all the precommitment wars and so do other agents build by CDT agents.
  
  Eliezer Yudkowsky
  
  A simple gloss on why you ought to understand LDT before building a CDT Task AGI is that this study tells us ways we need to restrict a CDT Task AGI’s cognition to prevent it from asploding.
  
  Eliezer Yudkowsky
  
  Or to put it another way, if you write the most sophisticated program you are smart enough to write, you are not smart enough to debug it. A corollary of this principle is that you ought to understand the theory of the program one step more sophisticated than the less sophisticated program you actually write, in order to understand all the tradeoffs you are making by writing the less sophisticated program. It’s fine if you say that a mergesort is simpler than a quicksort, but I’d much more trust somebody who understood quicksort who then said that they needed to write a mergesort because that was easier to debug, as opposed to somebody who didn’t understand quicksort saying that they were going to write a mergesort because that was the best sorting algorithm they thought they could grasp. You need conceptual error margin between what you think you could build and what you try to actually build, so that you understand what lies on both sides of the theory-boundary around your choice to use a theory at that specific level of complication.
  
  Michael Cohen
  
  I’m skeptical that “son-of-cdt” is something the future ldt can define well. Certainly it doesn’t mean any agent created after the cdt agent exists, because that would be all agents, and the fact that there is has been one human cdt in the past would have resolved the ldts behavior already. There are, of course, lots of ways for our cdt AGI to create new agents that are aligned with it. How is the ldt supposed to define whether the cdt AGI made this new agent for which it is wondering whether to treat like as a sucker? If we created an ldt AGI instead, but the human creator of the ldt AGI had a cdt ancestor, will the future ldt AGI only offer our ldt AGI one cent in the ultimatum game?
  
  (Tangent: this is all from the intuition that most things shouldn’t depend on the continuity of the self, or whatever the term is for what parfit rejects. Continuity of the self seems like an even more dubious concept for AIs where “instances” of a “self” can branch in time rather than one succeeding the next with every passing second.)
  
  tl;dr any action the cdt takes will change the expected number of agents aligned with it that don’t get bullied by the ldt in the future. Which cases are “its sons”?
  
  Eliezer Yudkowsky
  
  Son-of-CDT is a particular algorithm that it seems CDT agents should build in a pretty convergent way. Its properties are well-defined to the extent CDT is well-defined. Taking all of its marbles is as easy as taking all of CDT’s marbles.
  
  Michael Cohen
  
  I’m not sold on that. Son-of-CDT could be literally anything, and it depends on empirical facts, like what sort of adversaries are likely to exist in the future. Indeed, any agent that could be created in the interim between cdt-birth and ldt-birth will be considered by cdt, and if ldt treats any of them fairly, that’s the sort of agent that will appear. If none of them will be treated fairly, that means ldt gives this treatment to all agents born after a cdt has existed (and that ship has sailed).
  
  Michael Cohen
  
  A separate sufficient objection (and if my case rests on this I acknowledge a good deal of capitulation): if it’s the first agi and it’s unbounded, it will secure a decisive advantage and make sure that nothing that can challenge it ever gets made. EDIT: In fact, even if it is bounded, it only meets our standard for an existential win if it is able to ensure than nothing that can challenge it ever gets made. So once we’ve figured out how to do that, there are no additional gains from giving it a decision theory besides cdt.
  
  Christopher Leong
  
  What’s LDT and is there a paper on how LDT agents can run this exploit?
  
  Eliezer Yudkowsky
  
  Try Logical decision theories.
  
  Linda Linsefors
  
  A CDT agent predicts that an LDT agent will ignore its attempt to be unexploitable and go on rejecting offers short of $9.99. The problem is that the CDT agent is incapable of seeing controllability in the connection between its own abstract computation to accept the unfair offer in the Ultimatum Game, and the LDT agent having already simulated that computation and figured out how to exploit it.
  
  You are assuming that the LDT agent knows the history of the son-of-CDT is. How did the LDT agent get this information?
  
  The son-of-CDT agent will notice that if it can hide its history from the LDT agent, e.g. tricking it into believing that it was always LDT-like, or just making the LDT uncertain about the origin of the son-of-CDT, then, with the right self modifications, the son-of-CDT will not be exploitable.
  
  Me (Wei Dai) 2019
  
  One meta level above what even UDT tries to be is decision theory (as a philosophical subject) and one level above that is metaphilosophy, and my current thinking is that it seems bad (potentially dangerous or regretful) to put any significant (i.e., superhuman) amount of computation into anything except doing philosophy.
  
  To put it another way, any decision theory that we come up with might have some kind of flaw that other agents can exploit, or just a flaw in general, such as in how well it cooperates or negotiates with or exploits other agents (which might include how quickly/cleverly it can make the necessary commitments). Wouldn’t it be better to put computation into trying to find and fix such flaws (in other words, coming up with better decision theories) than into any particular object-level decision theory, at least until the superhuman philosophical computation itself decides to start doing the latter?
  
  (Current thoughts) I’m not sure why SIAI/MIRI didn’t take a position more similar to mine (which you may share, if in “Why wouldn’t an early seed AI reason about the ways that it’s decision theory makes it exploitable” you mean philosophical reasoning as opposed to “reasoning through its current decision theory”) that ideally the seed AI would solve decision theory for itself by doing good philosophy. Eliezer may have written one comment (that I can recall) where he said something about this, but given lack of good searching abilities I’ll have to spin up an AI-assisted project to find it.
  
  Edit: Forgot to mention that I think the AI solving decision theory (and other philosophical problems) incorrectly, by doing bad philosophy, is a serious risk. Not so much that it would pick CDT or Son-of-CDT, but messing up more subtly, like putting oneself in a worse bargaining position by simulating another agent for too long (thereby giving the other agent an incentive to commit to only accepting a deal favorable to themselves). Or more generally not figuring out a solution to such commitment race problems, or the many other tricky decision theory problems.
  - Wei Dai 18 Feb 2026 2:34 UTC
    11 points
    2
    Parent
    I found^[1] Eliezer’s 2012 comment where he talked about why he didn’t want FAI to solve philosophical problems for itself:
    I have been publicly and repeatedly skeptical of any proposal to make an AI compute the answer to a philosophical question you don’t know how to solve yourself, not because it’s impossible in principle, but because it seems quite improbable and definitely very unreliable to claim that you know that computation X will output the correct answer to a philosophical problem and yet you’ve got no idea how to solve it yourself. Philosophical problems are not problems because they are well-specified and yet too computationally intensive for any one human mind. They’re problems because we don’t know what procedure will output the right answer, and if we had that procedure we would probably be able to compute the answer ourselves using relatively little computing power. Imagine someone telling you they’d written a program requiring a thousand CPU-years of computing time to solve the free will problem.
    Interesting to compare this to my Some Thoughts on Metaphilosophy, where I argued for the opposite.
    ^
    Using my recently resurrected LW Power Reader & User Archive userscript. The User Archive part allows one to create an offline archive (in browser storage) of someone’s complete LW content and then do a search like /philosoph/ replyto:Wei_Dai
- Buck 15 Feb 2026 7:09 UTC
  14 points
  −5
  Parent
  Why wouldn’t an early seed AI reason about the ways that it’s decision theory makes it exploitable, or the ways it’s decision theory which it bars it from cooperation with distant superintelligence (just as the the researchers at SI were doing), find the best solution to those problems, and then modify the decision theory?
  I think that decision theory is probably more like values than empirical beliefs, in that there’s no reason to think that sufficiently intelligent beings will converge to the same decision theory. E.g. I think CDT agents self-modify into having a decision theory that is not the same as what EDT agents self-modify into.
  (Of course, like with values, it might be the case that you can make AIs that are “decision-theoretically corrigible”: these AIs should try to not take actions that rely on decision theories that humans might not endorse on reflection, and they should try to help humans sort out their decision theory problems. I don’t have an opinion on whether this strategy is more or less promising for decision theories than for values.)
  (Aside from decision theory and values, the main important thing that I think might be “subjective” is something like your choice over the universal prior.)
  - habryka 15 Feb 2026 19:14 UTC
    2 points
    −8
    Parent
    I think this is extremely unlikely and I am honestly very confused what you could possibly mean here. Are you saying that there is no sense in which greater intelligence reliably causes you to cooperate with copies of yourself in the prisoner’s dilemma?
    (And on the meta level, people saying stuff like this makes me think that I would really still like more research into decision-theory, because I think there are strong arguments in the space that could be cleaned up and formalized, and it evidently matters quite a bit because it causes people to make really weird and to-me-wrong-seeming predictions about the future)
    - Buck 15 Feb 2026 20:01 UTC
      5 points
      5
      Parent
      CDT agents will totally self modify into agents that cooperate in twin prisoners dilemma, but my understanding is that the thing it self modifies into (called “son of CDT”) behaves differently than e.g. the thing EDT agents self modify into.
      - JesseClifton 15 Feb 2026 22:46 UTC
        6 points
        4
        Parent
        They will only self-modify to cooperate with twins whose action is causally downstream of their commitment, right? So a CDT agent will not self-modify to a policy that does acausal trade with twins outside the lightcone for example.
      - habryka 15 Feb 2026 20:32 UTC
        5 points
        3
        Parent
        Yeah, I am not saying there is 100% convergence in decision-theory land (there also isn’t 100% convergence in epistemology land), but this is very different from saying “I think that decision theory is probably more like values than empirical beliefs”.
        Bayesian priors also don’t converge, they only converge in classes (most obviously anything you assign zero probability to is something you will never start believing).
        The situation with decision-theory seems a-priori pretty similar. There is lots of convergence, but the convergence only occurs under various conditions, and probably will form various classes of possible theories (and then my guess is similar to probability theory, in practice one of the classes will be the one that we expect all actual minds to fall into, but that’s very much something up in the air and unstudied and I am not confident of it).^[1]
        ^
        Also, to be clear, the “values are things you choose” thing is of course also only partially true. At least any human minds, and probably any AI minds we will create, will only have some very partial representation of their values that will require a huge amount of logical reasoning and interplay with decision-theory and epistemology to meaningfully unfold into something that could constitute a preference ordering.
        So in some sense I am not even sure how to talk about preferences in the absence of decision-theory and epistemology, both of which have a lot of structure and convergence and as such will create convergence dynamics in value-space as well. My values are certainly subject to reflection which depends on my epistemological and decision-theoretic principles, and the same seems true for almost all minds.
- habryka 15 Feb 2026 4:19 UTC
  12 points
  7
  Parent
  If I imagine being as confused as economists are about CDT, I do really repeatedly end up making very dumb and wrong predictions about what e.g. AIs would do when you have many copies of them, and they try to coordinate with each other.
  Like, it rarely happens that I have a conversation about either technical AI safety, or about AI strategy, where decision theory considerations don’t at least come up once in some form or another. Not having an answer here feels probably like people must have felt before we had probability theory, which also I have no idea what I would do without.
  - Buck 15 Feb 2026 7:04 UTC
    4 points
    0
    Parent
    I think this is a good argument for understanding basic decision theory points, but I don’t think it leads to you needing to develop any fancier decision theory—arguing about what decision theory AIs will use just requires thinking about descriptive facts about decision theories, rather than coming up with decision theories that work well in limits that aren’t important for the most important kinds of AI futurism (including the AI futurism questions I think you’re talking about here).
    - habryka 15 Feb 2026 9:25 UTC
      7 points
      0
      Parent
      “Basic decision theory points” feels like a pretty weird description of something that even quite smart people still frequently disagree on, has no formal description, and indeed often turns out to be the crux of an argument.
      I currently don’t think it’s worth my time figuring things out here much more, but that’s mostly because I do have some reasonable confidence that thinking about decision theory harder probably won’t produce any quick breakthroughs. But if I was in the world that MIRI faced 15 years ago, my guess is I would have thought it was worth investing in quite a bit, in case it does turn out to be relatively straightforward (which it so far has not turned out to be).
      rather than coming up with decision theories that work well in limits that aren’t important for the most important kinds of AI futurism
      Pushing more straightforwardly back on this: I do not think our current understanding of decision-theory is better in the mundane case than the limit case. Of course the reason to look at limiting cases is because you always do that in math because the limiting cases often turn out easier, not harder than the mundane case.
- quetzal_rainbow 15 Feb 2026 4:07 UTC
  6 points
  2
  Parent
  It’s tiling agents+embedded agency agenda. They wanted to find a non-trivial reflectively-stable embedded-in-environment structure and decision theory lies on intersection.
- Matrice Jacobine 16 Feb 2026 1:29 UTC
  4 points
  0
  Parent
  @Rob Bensinger on the EA Forum:
  As a side-note, I do want to emphasize that from the MIRI cluster’s perspective, it’s fine for correct reasoning in AGI to arise incidentally or implicitly, as long as it happens somehow (and as long as the system’s alignment-relevant properties aren’t obscured and the system ends up safe and reliable).
  The main reason to work on decision theory in AI alignment has never been “What if people don’t make AI ‘decision-theoretic’ enough?” or “What if people mistakenly think CDT is correct and so build CDT into their AI system?” The main reason is that the many forms of weird, inconsistent, and poorly-generalizing behavior prescribed by CDT and EDT suggest that there are big holes in our current understanding of how decision-making works, holes deep enough that we’ve even been misunderstanding basic things at the level of “decision-theoretic criterion of rightness”.
  It’s not that I want decision theorists to try to build AI systems (even notional ones). It’s that there are things that currently seem fundamentally confusing about the nature of decision-making, and resolving those confusions seems like it would help clarify a lot of questions about how optimization works. That’s part of why these issues strike me as natural for academic philosophers to take a swing at (while also being continuous with theoretical computer science, game theory, etc.).
- TsviBT 15 Feb 2026 5:48 UTC
  4 points
  0
  Parent
  IDK about the history, but at least retrospectively that’s not my reason: https://www.lesswrong.com/posts/Ht4JZtxngKwuQ7cDC?commentId=koeti9ygXB9wPLnnF

Eli Tyre comments on Eli’s shortform feed

Eliezer 2018

2018 FB discussion

Me (Wei Dai) 2019