Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
I think the same problem arises for some empirical questions too—T1 and T2 can be questions like “is iron’s atomic number 26 or 27?” I would have been roughly 50-50 before looking it up, but I’m uncertain if I should try to cooperate with people living in worlds where the atomic number of iron is 27 - I don’t know if those worlds are compatible with life.
Minor: The question about whether those worlds are compatible with life seems like a logical rather than empirical question to me. So this still seems like an issue with logical updatelessness rather than empirical updatelessness.
As an example: when I’m betting on the atomic number of iron, I shouldn’t think of myself as cooperating with versions of myself who live in a world where iron has 27 protons. Those worlds might not exist. But I’m cooperating with instances where the game-master decided to ask if iron has 25 or 26 protons.
As in: If you’re in a counterfactual mugging where Omega says they’d reward you in a world where iron has 27 protons if you pay in this world, then you pay because you expect there to be a bunch of Omegas elsewhere doing other logical counterfactual muggings. In roughly half of those cases, someone is about to get paid by omega if their impossible counterpart pays. And your action provides evidence that their impossible counterpart pays and that Omega gives them the reward.
And the same structure applies in the calculator example and the “conjunction of two theorems” example, because you’re correlated with a bunch of other distant people where you have so little information about the details of their situation that your epistemic position is “ex-ante” relative to their dilemma, so even if you’re updateful, you bet to optimize ex-ante utility in your case, to get evidence that they bet to optimize ex-ante utility in their case.
Hm, maybe that’s right.
Doesn’t that feel really unsatisfying though? I still feel like updateful EDT recommends wrong actions in important test cases. It’s just a contingent fact that most dilemmas like this will be small-scale in a large-scale universe, and that EDT’s recommendation to act as if you double-update will be swamped by not wanting to get evidence that other people elsewhere double-updates. And there’s always going to be that force pushing towards recommending double-updating, so if you ever get evidence that a decision is high-stakes enough and universal enough throughout the universe, EDT may well recommend that you make an exception for it and do a proper double-update on it, which seems bad.
Markets can be better or worse depending on eg liquidity. My guess would be that today’s markets are better. (The large difference between 83 and 91 cents failing to disappear from arbitrage is an indication that at least one of those markets weren’t so great, though I haven’t checked how current markets look on that metric.)
Maybe I’m confused about how much you believe that my actual life history matters. I think in the case of empirical updatelessness, my life history doesn’t really matter—I will eventually try to trade with people in proportion to something like their measure in the Solomonoff prior, and not with worlds where Austria and Australia are the same country, even though I was uncertain about this empirical fact when I was 5. (Do you agree with this, or do you think life history also matters for empirically updateless trade?)
If you were 100% empirically updateless and 0% logically updateless, then I think your life history wouldn’t matter except insofar as it led you to learning different logical facts. Insofar as you would eventually reach ‘logical maturity’ regardless of your life history, and learn some vast and similar swaths of logical facts, then yeah, eventually your life history wouldn’t matter.
I expect that logical updatelessness is similar—I will try to use some elegant construction like the Solomonoff prior to put a weight on different logical counterfactuals, and it won’t matter how my prior was constructed in my childhood.
My current understanding is that this requires updating on a bunch of logical arguments (about what that elegant construction should be and what it implies) and can therefore get you dutch-booked.
In many ways I like the baseline strategy of “ignore decision theory, act in ways that heuristically seem like they gather option-value, figure out all the hard stuff with the help of superintelligence”.
I guess your proposal is similar to that except there’s an addition of “we have a hunch that something like ECL works and that this means we should be a bit more cooperative, so we’ll be a bit more cooperative”.
But for some purposes, it does seem useful to know the implications of decision theory. A few examples (some more important than others):
Is there categories of information that people can plausibly be harmed by learning, that we should try to avoid, or will it clearly be correct (on reflection) to be retroactively updateless?
Understand more detailed implications of ECL, like who we should cooperate with and how much.
Should we do some crazy DT-motivated pseudo-alignment scheme like this or this.
For the purposes of answering questions like this, I’m interested in whether I basically buy EDT (and with what kind of updatelessness) or if the real answer to DT is going to be pretty different.
One concern here is that EDT is maybe less of coherent or appropriate decision theory than I thought. I don’t really think your plan addresses that. Like, your plan talks about how we’ll have lots of resources to think about what our prior should be, but doesn’t really address this part:
As someone who is sympathetic to EDT, I have arguments on the one hand that I shouldn’t update my prior on the basis of observations or proofs. (Indeed — arguments that suggest that such updates would lead to seemingly dumb and surely-avoidable accounting error.) But on the other hand, I need to do some sort of reasoning or learning to construct (or refine into coherence) my prior in the first place. I don’t currently know how to do this latter thing while avoiding the former thing.
These don’t seem like esoteric failures where it’s plausible that the hypothetical isn’t fair, or where the dynamic inconsistency is plausibly explained as being caused by changing values, or where there’s a fundamental tradeoff between the harms and value of new information. The errors happen in very mundane situations, and to me they look like dumb and surely-avoidable accounting errors.
More precisely: I think these failures probably enables dutch books that CDT isn’t susceptible to.
So it’s not just that insufficient updatelessness fails to capture some potential value that you could have gotten if you were more updateless. It’s that it’s actively worse than CDT in some cases.
And that’s a large part of why I now feel more unhappy about EDT. I feel like I have a better sense of what my beliefs are than what my prior is. If EDT requires me to act according to my prior (and will lead to me making stupid decisions if I instead act according to changing beliefs) then I’m not sure exactly how to do that.
Here’s my current picture of EDT and UDT.
In situations where EDT agents have many copies or near-copies, an EDT agent operates by imagining that it simultaneously controls the decisions of all those copies. This works very elegantly as long as it optimizes with respect to its prior and (upon learning new information) just changes its beliefs about what people in the prior it can control the actions of. (I.e., when it sees a blue sky, it shouldn’t change its prior to exclude worlds without blue skies, but it should make its next decision to optimize argmax_U[EV_prior(U|”an agent like me who has seen a blue sky would take action A”)]
As described here, EDT agents will act in very strange ways if they also update their prior upon observing evidence. As described here, EDT agents will act similarly strangely if they update their logical priors upon deriving a proof. (Though the logical situation is somewhat less bad than the empirical situation.) These don’t seem like esoteric failures where it’s plausible that the hypothetical isn’t fair, or where the dynamic inconsistency is plausibly explained as being caused by changing values, or where there’s a fundamental tradeoff between the harms and value of new information. The errors happen in very mundane situations, and to me they look like dumb and surely-avoidable accounting errors.
Unfortunately, the most obvious solution to the problem is something like “stick with your prior and never change it”. And that’s not really available as a solution to a bounded agent like me.
I don’t have a complete and coherent prior of the world. I can procedurally generate beliefs when asked about them, and you could try to construct a complete set of beliefs (perhaps to be used as priors) out of this (e.g. you could say that I “currently believe X with probability p” if I would respond with probability p upon being asked about X and given 1 minute to think). But any such set of beliefs would be very contradictory and incoherent. And I suspect that EDT might not look as good if the prior it starts with is very contradictory and incoherent.[1]
This creates a bit of a puzzle. As someone who is sympathetic to EDT, I have arguments on the one hand that I shouldn’t update my prior on the basis of observations or proofs. (Indeed — arguments that suggest that such updates would lead to seemingly dumb and surely-avoidable accounting error.) But on the other hand, I need to do some sort of reasoning or learning to construct (or refine into coherence) my prior in the first place. I don’t currently know how to do this latter thing while avoiding the former thing.
Perhaps the most elegant thing would be to have an account of some particular kind of reasoning or learning that could be used to construct/refine priors, where we’d be able to show that it doesn’t run into the same issues as the ones we run into when we modify our beliefs in response to observations like this or in response to proofs like this. Or maybe it’s EDT that needs to change, or the idea of making decisions based on priors.
Misc points:
I think open-minded updatelessness is an attempt at something like this, where “awareness growth” is a separate type of operation from normal updating, and only awareness growth is allowed to modify the prior. I find it hard to evaluate because I don’t know how awareness growth is supposed to mechanically work.
I don’t think FDT obviously does better than EDT here.
You might hope that the question of “what’s the prior?” might turn out to not be so important, as long as we eventually receive enough evidence about what subsets of universes we have the most impact in. However, it looks to me like the prior may be extremely important, because it seems plausible to me that EDT recommends doing ECL from the perspective of the prior. If so, what values we benefit in the future may be primarily determined by their frequency in our prior.
I’m not sure though. Maybe there’s some minimum level of coherence that is sufficient to motivate reasoning from an ex-ante perspective. This report from Martin Soto tries to construct somewhat coherent and complete priors from logical inductors run for a finite amount of time, and then do updatelessness on the basis of them. That seems like a promising place to look for further insights.
I do think there’s a sense in which CDT behavior is evolutionarily selected for in environments where agents can’t see each others’ decision theories.
I don’t see this as a big problem with UDT. If UDT wants to be evolutionarily fit relative to other agents in the environment, then I think they could adopt CDT behavior and do just as well as CDT.
It’s just that, due to the virtue of their decision theory (according to themselves), they have the option of giving up evolutionary fitness in exchange for higher utility in the short run. If they care more about short-run utility than evolutionary fitness, then perhaps they take the deal.
I don’t think this option is a strike against UDT. In any situation where agents care about X but we’re scoring them on Y, there will be scenarios where their Y-score gets hurt if we give them tools for achieving X which trades off against Y.
If there’s a population of mostly TDT/UDT agents and few CDT agents (and nobody knows who the CDT agents are) and they’re randomly paired up to play one-shot PD, then the CDT agents do better. What does this imply?
Maybe you’d get the same effect if you had 100% UDT agents but with 99% being in blue rooms and 1% being in red rooms. The ones in red rooms would reason that they could defect against the ones in blue rooms because they are in a relevantly different situation due to being in a minority that can easily coordinate defection against the majority. (With the majority still being motivated to cooperate even if they are only correlated with each other.) If so, there’s a sense in which the CDT agents aren’t benefitting anymore than they would if they were UDT agents who got a CDT sticker.
(Note that the red room / blue room thing doesn’t fundamentally break correlations here. Two UDT agents who are playing a symmetric game against each other, when one is in a blue room and one is in a red room, would still be able to cooperate. The thing that breaks the correlation is that the people in the red room are in an easily identifiable minority in a game where a minority can benefit from defection. Which isn’t true in symmetric PD.)
This would raise the question about what’d happen if all the UDT agents were given different serial numbers. Is there some serial numbers that could defect without negative evidence about the other UDT agents?
Hm, in order for this to work in practice, the UDT agents would have to have their serial number or room-color assignment already be present in their prior. If it’s information they receive later-on, they should probably be updateless about it and just cooperate even if they’re in the minority.
I think this is a strong argument that EDT agents shouldn’t do bayesian updates on empirical observations. I thought that it might still be ok to change your mind on the basis of logical arguments and reasoning (not empirical data or observations). But I think a very similar argument bites against that.
Example:
There’s 2 mathematical propositions, each of which you think have an independent 50% probability of being true: Y1 and Y2.
The proposition X = “Y1 and Y2”.
Presumably you assign 25% to X being true.
Let’s say you try to prove Y1 to be true, and succeed. You don’t have time to prove Y2. Naively, you’d know expect to assign 50% to X being true, or be willing to be on 1:1 odds.
However, let’s say that there are many copies of you across the universe, and equally many of them tried to prove statement Y1 and Y2. For simplicity, let’s say everyone who tried to prove a true statement succeeded, and no one had time to attempt to prove more than one statement.
Given an opportunity to bet on X being true, and thinking about your odds, you reason:
If X is true, then Y2 will be true (in addition to Y1).
So if X is true, and I bet on X, then everyone will bet on X and everyone will win. (Assuming that someone who proved Y2 is relevantly in the same position as me, so that my choosing to bet provides strong evidence that they will bet.)
If X is false, then Y2 will be fase.
So if X is false, and everyone in my position bets on X, in expectation just 1⁄2 people will lose. (The ones who proved Y1.)
So the stakes are twice as high if X is true than false.
Since I assign X a 50% chance (or 1:1) of being true, I will bet on (1:1) * (2:1) = (2:1) odds that X is true. I.e., from this perspective, the EV calculation becomes:
EV(bet on 2:1 odds) = 50% [that X is true] * 2 [people who win if X is true] * 1 [payout if X is true] + 50% [that X is false] * 1 [people who lose if X is false] * (-2) [payout if X is false] = 0.
This strategy will lose money in expectation.
From the ex-ante perspective, there are 4 equiprobable worlds where (Y1,Y2) have different truth values. In 1 of them, neither is true; in 2 of them, exactly 1 is true; and in 1 one of them, both are true. From the ex-ante perspective, there’s 2 people who prove their statement true when X is false; and 2 people who prove their statement true when X is true. If they all bet at 2:1 odds that X is true, they’ll lose money in expectation.
One difference from the empirical case in the post above is that you need to perceive yourself as correlated with people who proved a different statement than you did.
Edit 2026-04-20:
I significantly simplified the example above.
I want to flag that I think the argument against logical updates is somewhat weaker than the argument against empirical updates. In particular, this even more unappealing argument doesn’t apply to the logical case as far as I can tell. (And the toy example above — disjunction of logical statements where you’ve proven one — is more rare than unreliable empirical evidence of logical facts, as in the calculator example above.)
Edit 2026-05-05: Dutch book variant:
Before proving, the bookie offers:
If X is true → you pay $1
If [the statement you’re about to prove] is true and X is false → you get $1.1
Then if you fail to prove your statement, revealing it to be false, no pay-out happens.
If you do prove your statement, the bookie offers:
If X is true → you get $0.7.
If [the statement you just proved] is true and X is false → you pay $1.2.
(2nd offer is accepted because the EDT agent perceives the stakes to be twice as high if X is true.)
Overall payout:
[your statement] is false → $0
X is true → -$0.3
[your statement] is true and X is false → -$0.1
seems very plausible that our long-term civilizational trajectory is significantly affected by which type of AGI gets built first
I of course agree, but I’d think this would mostly be an issue of capabilities or goodness of our future society, since there’s not much external to our society that’s getting worse as a result of the transition. Anyway, that seems like maybe one of those definitional issues. I think you’re probably right that there’s some possible changes that aren’t well characterized as being about the capabilities or goodness of our society, so an improvemet in those dimensions aren’t strictly speaking sufficient for a pause to not have been valuable.
I care more about my claim that started with “I just think there’s a decent chance...”. (Which is importantly only asserting a decent chance, not saying that there aren’t plausible ways it could be false.)
The aim of a pause would be to plan out the transition better, or make humans smarter/wiser so they can navigate the transition better, so that we end up handing over remaining problems to a counterfactually more capable society. In other words, the bar shouldn’t be “more capable than us” but a society that could realistically be achieved with a pause
If the society is “more capable than us” in some average sense, where we still have certain advantages over them, then I agree that we could still contribute things.
If the society is “more capable (and good) than us” in all the important ways, then they’d also be better at making themselves smarter/wise than we would have been, and better at handling the transition, so further pauses really wouldn’t have contributed much.
Idk, I don’t know particularly want to argue about definitions here. I just think there’s a decent chance that I’ll look back after the singularity and be like “yep, the sloppy transition sure meant that we took on a bunch of ex-ante risk, but since we got lucky, extra pause time wouldn’t have helped vis-a-vis the long-run lock-in issues. Anything they could have done to help is stuff we can do better now.” (And/or: Marginal pause time may have been good or bad via various values or power changes, but it wouldn’t have systematically led to improvements from everyone’s perspective by e.g. enabling additional intellectual work, because it turns out it was fine to defer the relevant intellectual work until later.)
Gotcha.
FWIW, on my views, work to prevent scheming looks pretty clearly great. Pausing to wait for a solution to scheming doesn’t seem super likely, and going from [scheming models widely deployed] –> [non-scheming models widely deployed] seems significantly more valuable than going from [non-scheming models widely deployed] –> [temporary pause to solve scheming].
including that the benefits of solving scheming are limited by other safety problems
A lot of the listed topics here are problems that we could have plenty of time to work on after the singularity. I’m sympathetic to arguments that bad things might get locked-in, but I don’t really think the arguments for this have a disjunctive nature where we’re very likely to run into at least one type of bad lock-in. There’s just a decent chance that we do an ok job of developing AIs and handing over to a society that’s more capable than us at dealing with these issues (not a super high bar), in which case a pause wouldn’t add much. (The arguments that make me feel most pessimistic about the future are arguments that humans might just not be motivated to do good things — but it’s not clear why pauses would help much with that issue.)
In the particular case of the inconsistencies highlighted by transparent Newcomb, I think that it’s unusually clear that you want to avoid your values changing—because your current values are a reasonable compromise amongst the different possible future versions of yourself, and maintaining those values is a way to implement important win-win trades across those versions.
I slightly disagree with this. In cases where there are win-win trades, different future versions of yourself are probably similar enough that they can get these win-win trades via correlated decision-making. (If they follow EDT.)
If you stop your values from changing, I think the main additional benefit you get is that you (i) change which of your future selves are more or less likely to exist in the first place (which it’s not obvious that they themselves will care about; c.f. my other comment), and (ii) impose one-way utility transfers from versions of you who have good helping opportunities to versions of yourselves who have good being-helped opportunities, according to your own view about how you want to do interpersonal utility comparisons between your future selves (which will predictably benefit some of them and harm some other of them). [1]
Overall this still seems fine and good to me. But I think win-win trades are a small fraction of the benefits.
Or maybe this is also just about changing which future versions of yourselves exist, since any difference in your present actions will arguably lead to somewhat different memories in future versions of yourself.
Once upon a time they cared about all of the possible versions of themselves, weighted by their probability. But once they see the empty big box, they cease to care at all about the versions of themselves who saw a full box. They end up in conflict with other very similar copies of themselves, and from the perspective of the human at the beginning of the process the whole thing is a great tragedy.
Probably just an unimportant nitpick, but the “versions who saw a full box” shouldn’t actually expect to see a brighter future if “the version who saw an empty box” chooses to 1-box. The only thing that happens is that the “versions who saw a full box” become more likely to exist. So I think you have to either say:
This is a conflict where a significant portion of the “benefit” at stake is getting to exist in the first place
This isn’t a conflict between the versions who saw an empty box and the versions who saw a full box. Instead, it’s a conflict between the “versions who saw an empty or full box” and the past “version who hadn’t yet looked at the boxes”. (The “version who hadn’t yet looked at the boxes” really would expect a brighter future if the “versions who saw an empty or full box” choose to 1-box.)
For the purposes of this argument to work, it’s important that the legible problems are so legible that a lack of solutions would prevent deployment.
When previously asked which problems were in this category, you said:
The most legible problem (in terms of actually gating deployment) is probably wokeness for xAI, and things like not expressing an explicit desire to cause human extinction, not helping with terrorism (like building bioweapons) on demand, etc., for most AI companies
Now, I would actually say that this list overestimates AI companies’ willingness to gate deployment on unsolved problems. There’s been many woke versions of grok, suggesting they weren’t gating deployments on that. I think most current models can be jailbroken into helping with terrorism (they’re just not smart enough to be very helpful yet). It remains to be seen whether companies will hold off on releasing models that could help a lot with terrorism. I’m not so sure they will.
But even if we took this on face value: It doesn’t seem like avoiding work on these mentioned problems would mean restricting the portfolio a lot. When referring to “playing a portfolio of all the different desperate hard strategies in the hopes that one of them works”, I think that’s mostly about solving problems that wouldn’t prevent deployment if they were unsolved, or gathering evidence for such illegible problems. (Centrally: The problem of scheming models taking over the world, which is not one that I expect companies to wait for a solution on absent further evidence that it’s a problem.)
Locally trying to clear up one misunderstanding.
In exchange, we would have won a couple more years of timeline, which would have been pointless, because timeline isn’t measured in distance from the year 1 AD, it’s measured in distance between some level of woken-up-ness and some point of danger, and the woken-up-ness would be pushed forward at the same rate the danger was.
I feel like we both know this is a strawman. The key thing at least in recent years that Rob, Eliezer and Nate have been arguing for is the political machinery necessary to actually control how fast you are building ASI, and the ability to stop for many years at a time, and to only proceed when risks actually seem handled.
If anything, Eliezer, Nate and Robby have been actively trying to move political will from “a pause right now” to “the machinery for a genuine stop”.
I think Scott’s “couple more years” wasn’t referring to a belief that EA could have successfully advocated for a couple of year pause, but rather referring to the change in timeline you’d have gotten if safety-sympathetic people refused to work on stuff that increases the pace of capabilities progress.
IIRC someone I know tried to look into this at some point (at least the physics). I’ll see if I can learn what they found. (Edit: It was Toby who’ve now left a comment here.)
LLM output doesn’t seem nearly quantitative enough. With some numbers of 9s, it surely doesn’t give you a meaningful advantage to go at 0.99...99c rather than merely 0.99...9c — especially when you factor in that it probably takes time to convert energy/mass into the additional speed (most mass will be in between your origin and the farthest reaches of the universe, and by the time some payload have decelerated and started harvesting significant energy from the middle mass, the frontier of the colonization wave will likely already be quite distant). I share Ryan’s guess that you can get close enough to optimum without burning a large fraction of all energy in the universe. (That’s a lot of energy!)
So my current impression is basically that you’re optimistic about something similar to this:
And that your argument here is an argument for why it won’t be possible to make double-update arguments about this more narrow type of beliefs or reasoning. (Because it’s so fundamental that if it differs between agents, they can’t be correlated with each other.)
That argument seems plausible but not very robust to me at the moment. I mostly don’t have a great sense of what the “some beliefs to get [otherwise updateless] EDT off the ground”-paradigm looks like. Maybe I’ll look into it more.