Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Here’s my current picture of EDT and UDT.
In situations where EDT agents have many copies or near-copies, an EDT agent operates by imagining that it simultaneously controls the decisions of all those copies. This works very elegantly as long as it optimizes with respect to its prior and (upon learning new information) just changes its beliefs about what people in the prior it can control the actions of. (I.e., when it sees a blue sky, it shouldn’t change its prior to exclude worlds without blue skies, but it should make its next decision to optimize argmax_U[EV_prior(U|”an agent like me who has seen a blue sky would take action A”)]
As described here, EDT agents will act in very strange ways if they also update their prior upon observing evidence. As described here, EDT agents will act similarly strangely if they update their logical priors upon deriving a proof. (Though the logical situation is somewhat less bad than the empirical situation.) These don’t seem like esoteric failures where it’s plausible that the hypothetical isn’t fair, or where the dynamic inconsistency is plausibly explained as being caused by changing values, or where there’s a fundamental tradeoff between the harms and value of new information. The errors happen in very mundane situations, and to me they look like dumb and surely-avoidable accounting errors.
Unfortunately, the most obvious solution to the problem is something like “stick with your prior and never change it”. And that’s not really available as a solution to a bounded agent like me.
I don’t have a complete and coherent prior of the world. I can procedurally generate beliefs when asked about them, and you could try to construct a complete set of beliefs (perhaps to be used as priors) out of this (e.g. you could say that I “currently believe X with probability p” if I would respond with probability p upon being asked about X and given 1 minute to think). But any such set of beliefs would be very contradictory and incoherent. And I suspect that EDT might not look as good if the prior it starts with is very contradictory and incoherent.[1]
This creates a bit of a puzzle. As someone who is sympathetic to EDT, I have arguments on the one hand that I shouldn’t update my prior on the basis of observations or proofs. (Indeed — arguments that suggest that such updates would lead to seemingly dumb and surely-avoidable accounting error.) But on the other hand, I need to do some sort of reasoning or learning to construct (or refine into coherence) my prior in the first place. I don’t currently know how to do this latter thing while avoiding the former thing.
Perhaps the most elegant thing would be to have an account of some particular kind of reasoning or learning that could be used to construct/refine priors, where we’d be able to show that it doesn’t run into the same issues as the ones we run into when we modify our beliefs in response to observations like this or in response to proofs like this. Or maybe it’s EDT that needs to change, or the idea of making decisions based on priors.
Misc points:
I think open-minded updatelessness is an attempt at something like this, where “awareness growth” is a separate type of operation from normal updating, and only awareness growth is allowed to modify the prior. I find it hard to evaluate because I don’t know how awareness growth is supposed to mechanically work.
I don’t think FDT obviously does better than EDT here.
You might hope that the question of “what’s the prior?” might turn out to not be so important, as long as we eventually receive enough evidence about what subsets of universes we have the most impact in. However, it looks to me like the prior may be extremely important, because it seems plausible to me that EDT recommends doing ECL from the perspective of the prior. If so, what values we benefit in the future may be primarily determined by their frequency in our prior.
I’m not sure though. Maybe there’s some minimum level of coherence that is sufficient to motivate reasoning from an ex-ante perspective. This report from Martin Soto tries to construct somewhat coherent and complete priors from logical inductors run for a finite amount of time, and then do updatelessness on the basis of them. That seems like a promising place to look for further insights.
I do think there’s a sense in which CDT behavior is evolutionarily selected for in environments where agents can’t see each others’ decision theories.
I don’t see this as a big problem with UDT. If UDT wants to be evolutionarily fit relative to other agents in the environment, then I think they could adopt CDT behavior and do just as well as CDT.
It’s just that, due to the virtue of their decision theory (according to themselves), they have the option of giving up evolutionary fitness in exchange for higher utility in the short run. If they care more about short-run utility than evolutionary fitness, then perhaps they take the deal.
I don’t think this option is a strike against UDT. In any situation where agents care about X but we’re scoring them on Y, there will be scenarios where their Y-score gets hurt if we give them tools for achieving X which trades off against Y.
If there’s a population of mostly TDT/UDT agents and few CDT agents (and nobody knows who the CDT agents are) and they’re randomly paired up to play one-shot PD, then the CDT agents do better. What does this imply?
Maybe you’d get the same effect if you had 100% UDT agents but with 99% being in blue rooms and 1% being in red rooms. The ones in red rooms would reason that they could defect against the ones in blue rooms because they are in a relevantly different situation due to being in a minority that can easily coordinate defection against the majority. (With the majority still being motivated to cooperate even if they are only correlated with each other.) If so, there’s a sense in which the CDT agents aren’t benefitting anymore than they would if they were UDT agents who got a CDT sticker.
(Note that the red room / blue room thing doesn’t fundamentally break correlations here. Two UDT agents who are playing a symmetric game against each other, when one is in a blue room and one is in a red room, would still be able to cooperate. The thing that breaks the correlation is that the people in the red room are in an easily identifiable minority in a game where a minority can benefit from defection. Which isn’t true in symmetric PD.)
This would raise the question about what’d happen if all the UDT agents were given different serial numbers. Is there some serial numbers that could defect without negative evidence about the other UDT agents?
Hm, in order for this to work in practice, the UDT agents would have to have their serial number or room-color assignment already be present in their prior. If it’s information they receive later-on, they should probably be updateless about it and just cooperate even if they’re in the minority.
I think this is a strong argument that EDT agents shouldn’t do bayesian updates on empirical observations. I thought that it might still be ok to change your mind on the basis of logical arguments and reasoning (not empirical data or observations). But I think a very similar argument bites against that.
Example:
There’s 2 mathematical propositions, each of which you think have an independent 50% probability of being true: Y1 and Y2.
The proposition X = “Y1 and Y2”.
Presumably you assign 25% to X being true.
Let’s say you try to prove Y1 to be true, and succeed. You don’t have time to prove Y2. Naively, you’d know expect to assign 50% to X being true, or be willing to be on 1:1 odds.
However, let’s say that there are many copies of you across the universe, and equally many of them tried to prove statement Y1 and Y2. For simplicity, let’s say everyone who tried to prove a true statement succeeded, and no one had time to attempt to prove more than one statement.
Given an opportunity to bet on X being true, and thinking about your odds, you reason:
If X is true, then Y2 will be true (in addition to Y1).
So if X is true, and I bet on X, then everyone will bet on X and everyone will win. (Assuming that someone who proved Y2 is relevantly in the same position as me, so that my choosing to bet provides strong evidence that they will bet.)
If X is false, then Y2 will be fase.
So if X is false, and everyone in my position bets on X, in expectation just 1⁄2 people will lose. (The ones who proved Y1.)
So the stakes are twice as high if X is true than false.
Since I assign X a 50% chance (or 1:1) of being true, I will bet on (1:1) * (2:1) = (2:1) odds that X is true. I.e., from this perspective, the EV calculation becomes:
EV(bet on 2:1 odds) = 50% [that X is true] * 2 [people who win if X is true] * 1 [payout if X is true] + 50% [that X is false] * 1 [people who lose if X is false] * (-2) [payout if X is false] = 0.
This strategy will lose money in expectation.
From the ex-ante perspective, there are 4 equiprobable worlds where (Y1,Y2) have different truth values. In 1 of them, neither is true; in 2 of them, exactly 1 is true; and in 1 one of them, both are true. From the ex-ante perspective, there’s 2 people who prove their statement true when X is false; and 2 people who prove their statement true when X is true. If they all bet at 2:1 odds that X is true, they’ll lose money in expectation.
One difference from the empirical case in the post above is that you need to perceive yourself as correlated with people who proved a different statement than you did.
Edit 2026-04-20:
I significantly simplified the example above.
I want to flag that I think the argument against logical updates is somewhat weaker than the argument against empirical updates. In particular, this even more unappealing argument doesn’t apply to the logical case as far as I can tell. (And the toy example above — disjunction of logical statements where you’ve proven one — is more rare than unreliable empirical evidence of logical facts, as in the calculator example above.)
seems very plausible that our long-term civilizational trajectory is significantly affected by which type of AGI gets built first
I of course agree, but I’d think this would mostly be an issue of capabilities or goodness of our future society, since there’s not much external to our society that’s getting worse as a result of the transition. Anyway, that seems like maybe one of those definitional issues. I think you’re probably right that there’s some possible changes that aren’t well characterized as being about the capabilities or goodness of our society, so an improvemet in those dimensions aren’t strictly speaking sufficient for a pause to not have been valuable.
I care more about my claim that started with “I just think there’s a decent chance...”. (Which is importantly only asserting a decent chance, not saying that there aren’t plausible ways it could be false.)
The aim of a pause would be to plan out the transition better, or make humans smarter/wiser so they can navigate the transition better, so that we end up handing over remaining problems to a counterfactually more capable society. In other words, the bar shouldn’t be “more capable than us” but a society that could realistically be achieved with a pause
If the society is “more capable than us” in some average sense, where we still have certain advantages over them, then I agree that we could still contribute things.
If the society is “more capable (and good) than us” in all the important ways, then they’d also be better at making themselves smarter/wise than we would have been, and better at handling the transition, so further pauses really wouldn’t have contributed much.
Idk, I don’t know particularly want to argue about definitions here. I just think there’s a decent chance that I’ll look back after the singularity and be like “yep, the sloppy transition sure meant that we took on a bunch of ex-ante risk, but since we got lucky, extra pause time wouldn’t have helped vis-a-vis the long-run lock-in issues. Anything they could have done to help is stuff we can do better now.” (And/or: Marginal pause time may have been good or bad via various values or power changes, but it wouldn’t have systematically led to improvements from everyone’s perspective by e.g. enabling additional intellectual work, because it turns out it was fine to defer the relevant intellectual work until later.)
Gotcha.
FWIW, on my views, work to prevent scheming looks pretty clearly great. Pausing to wait for a solution to scheming doesn’t seem super likely, and going from [scheming models widely deployed] –> [non-scheming models widely deployed] seems significantly more valuable than going from [non-scheming models widely deployed] –> [temporary pause to solve scheming].
including that the benefits of solving scheming are limited by other safety problems
A lot of the listed topics here are problems that we could have plenty of time to work on after the singularity. I’m sympathetic to arguments that bad things might get locked-in, but I don’t really think the arguments for this have a disjunctive nature where we’re very likely to run into at least one type of bad lock-in. There’s just a decent chance that we do an ok job of developing AIs and handing over to a society that’s more capable than us at dealing with these issues (not a super high bar), in which case a pause wouldn’t add much. (The arguments that make me feel most pessimistic about the future are arguments that humans might just not be motivated to do good things — but it’s not clear why pauses would help much with that issue.)
In the particular case of the inconsistencies highlighted by transparent Newcomb, I think that it’s unusually clear that you want to avoid your values changing—because your current values are a reasonable compromise amongst the different possible future versions of yourself, and maintaining those values is a way to implement important win-win trades across those versions.
I slightly disagree with this. In cases where there are win-win trades, different future versions of yourself are probably similar enough that they can get these win-win trades via correlated decision-making. (If they follow EDT.)
If you stop your values from changing, I think the main additional benefit you get is that you (i) change which of your future selves are more or less likely to exist in the first place (which it’s not obvious that they themselves will care about; c.f. my other comment), and (ii) impose one-way utility transfers from versions of you who have good helping opportunities to versions of yourselves who have good being-helped opportunities, according to your own view about how you want to do interpersonal utility comparisons between your future selves (which will predictably benefit some of them and harm some other of them). [1]
Overall this still seems fine and good to me. But I think win-win trades are a small fraction of the benefits.
Or maybe this is also just about changing which future versions of yourselves exist, since any difference in your present actions will arguably lead to somewhat different memories in future versions of yourself.
Once upon a time they cared about all of the possible versions of themselves, weighted by their probability. But once they see the empty big box, they cease to care at all about the versions of themselves who saw a full box. They end up in conflict with other very similar copies of themselves, and from the perspective of the human at the beginning of the process the whole thing is a great tragedy.
Probably just an unimportant nitpick, but the “versions who saw a full box” shouldn’t actually expect to see a brighter future if “the version who saw an empty box” chooses to 1-box. The only thing that happens is that the “versions who saw a full box” become more likely to exist. So I think you have to either say:
This is a conflict where a significant portion of the “benefit” at stake is getting to exist in the first place
This isn’t a conflict between the versions who saw an empty box and the versions who saw a full box. Instead, it’s a conflict between the “versions who saw an empty or full box” and the past “version who hadn’t yet looked at the boxes”. (The “version who hadn’t yet looked at the boxes” really would expect a brighter future if the “versions who saw an empty or full box” choose to 1-box.)
For the purposes of this argument to work, it’s important that the legible problems are so legible that a lack of solutions would prevent deployment.
When previously asked which problems were in this category, you said:
The most legible problem (in terms of actually gating deployment) is probably wokeness for xAI, and things like not expressing an explicit desire to cause human extinction, not helping with terrorism (like building bioweapons) on demand, etc., for most AI companies
Now, I would actually say that this list overestimates AI companies’ willingness to gate deployment on unsolved problems. There’s been many woke versions of grok, suggesting they weren’t gating deployments on that. I think most current models can be jailbroken into helping with terrorism (they’re just not smart enough to be very helpful yet). It remains to be seen whether companies will hold off on releasing models that could help a lot with terrorism. I’m not so sure they will.
But even if we took this on face value: It doesn’t seem like avoiding work on these mentioned problems would mean restricting the portfolio a lot. When referring to “playing a portfolio of all the different desperate hard strategies in the hopes that one of them works”, I think that’s mostly about solving problems that wouldn’t prevent deployment if they were unsolved, or gathering evidence for such illegible problems. (Centrally: The problem of scheming models taking over the world, which is not one that I expect companies to wait for a solution on absent further evidence that it’s a problem.)
Locally trying to clear up one misunderstanding.
In exchange, we would have won a couple more years of timeline, which would have been pointless, because timeline isn’t measured in distance from the year 1 AD, it’s measured in distance between some level of woken-up-ness and some point of danger, and the woken-up-ness would be pushed forward at the same rate the danger was.
I feel like we both know this is a strawman. The key thing at least in recent years that Rob, Eliezer and Nate have been arguing for is the political machinery necessary to actually control how fast you are building ASI, and the ability to stop for many years at a time, and to only proceed when risks actually seem handled.
If anything, Eliezer, Nate and Robby have been actively trying to move political will from “a pause right now” to “the machinery for a genuine stop”.
I think Scott’s “couple more years” wasn’t referring to a belief that EA could have successfully advocated for a couple of year pause, but rather referring to the change in timeline you’d have gotten if safety-sympathetic people refused to work on stuff that increases the pace of capabilities progress.
IIRC someone I know tried to look into this at some point (at least the physics). I’ll see if I can learn what they found.
LLM output doesn’t seem nearly quantitative enough. With some numbers of 9s, it surely doesn’t give you a meaningful advantage to go at 0.99...99c rather than merely 0.99...9c — especially when you factor in that it probably takes time to convert energy/mass into the additional speed (most mass will be in between your origin and the farthest reaches of the universe, and by the time some payload have decelerated and started harvesting significant energy from the middle mass, the frontier of the colonization wave will likely already be quite distant). I share Ryan’s guess that you can get close enough to optimum without burning a large fraction of all energy in the universe. (That’s a lot of energy!)
It confuses me that it has significant mind-share among AI safety people, e.g. @ryan_greenblatt here, despite the world in general, and technological races in particular, obviously not being zero-sum.
FWIW, I find it useful to think about strategy stealing, and don’t think it has too much mindshare. Not really sure how to productive it is to argue about that though because “too much or little mindshare” seems hard to settle.
despite the world in general, and technological races in particular, obviously not being zero-sum
Just to respond to this in particular: Some situations are close to being zero-sum, and when they’re not, I think it’s often useful to explicitly track the reason why they’re not zero-sum and how that changes the dynamics.
My impression of people invoking strategy stealing is not that they’re actually assuming it holds without argument, but instead interested in specific reasons to believe it fails in a given situation, and (if they agree those reasons are real) often interested in quantifying how significant those reasons are. Ryan’s linked comment seems like an example of this.
Paul’s linked article talks about lots of ways that strategy stealing can fail, many of which aren’t downstream of violating unit-sum. (By my count, only 2 of them are about that.)
You say “even for consequentialists”, but iirc, non-consequentialism only really features in point 11, so that’s just one more.
Just to clarify that you’re not distilling the whole post but just providing an example for 1-2 of the issues.
They keep using that word ‘commitment.’ I will stop putting it in air quotes if they give clear communication that something really is a commitment that they cannot modify later as needed, with at least some slower and clear procedure required.
What sort of procedure do you have in mind?
Notably Holden thinks (last section) that the default procedure is slow and arduous:
What is the point of making commitments if you can revise them anytime?
I first started pushing for the move to RSP v3 in February 2025. It’s been a very long and effortful process, and “we can revise anytime” doesn’t seem remotely accurate as a description of the situation.
I’d like there to be some friction to revising our RSP, though at least somewhat less than there was for this update.
In my ideal world, revisions like this would get significant scrutiny, focused on the question: “Are these changes good on the merits?” But people would not start from a strong prior that the mere fact of loosening previous commitments is per se bad.
If the likely default process is already slow, maybe there could be a cheap win here where they (properly) commit to a process that is no more slow and arduous than the likely default, but where the process is known to the outside and can’t be circumvented.
Aren’t you worried that by presenting this “in between” approach without mentioning that it would still require some amount of luck, or equivalently still incur some amount of risk (of potentially catastrophic philosophical failure/error), it can be misleading for people who might read the post without themselves specializing in this area?
I don’t want to mislead people, so I guess it’s just a question about how people interpret posts like this. I suppose I would worry about this if I was presenting something framed like a decisive solution. But I’d think it’s pretty normal for posts to talk about a problem and present some promising-seeming interventions without thereby implying that the problem would be entirely solved if the interventions got carried out. (E.g.: If I read a post about climate change that suggested some interventions, I wouldn’t assume that those interventions would necessarily solve the whole problem.)
It also feels relevant that I don’t have a prescription for actions anyone could take to predictably achieve high justified confidence that the problem was solved. I don’t think that discovering a solution to metaphilosophy would be sufficient, because a big part of the problem is that people might not care about doing good philosophy even if a solution existed. I think that slower AI development (including a global pause) would probably be helpful on the current margin, for this risk, but I don’t think that a very long pause would get the risk down to very low levels. (There’s just a bunch of stuff that influence societal epistemics and its hard to know whether it’s heading in a good or bad direction on the time-scale of decades, at present technology levels. And I expect societal epistemics to have a big influence on the risk here.)
I have noticed that even people who share my general concern don’t seem to like to frame it the way that I do, i.e., as a need to solve metaphilosophy. For example in this post you never mention “metaphilosophy” again or talk about trying to understand the nature of philosophy. I’m pretty curious why that is.
(By “need” I mean it seems to be the only way to achieve high justified confidence that the concern has been addressed, not that the world is certainly doomed if we don’t solve metaphilosophy. I can see various ways that we “get lucky” and the problem kind of solves itself.)
In between ” ‘we get lucky’ and the problem kind of solves itself” and “we solve metaphilosophy and achieve high justified confidence”, there’s “we do a bunch of things that we think will help on the margin without leading to high confidence that the problem gets solved, and partially as a result of our interventions and partially due to luck, things turn out fine”. That’s more what I’m aiming at. And this doesn’t require tackling or solving metaphilosophy directly (which seems really difficult!), which is probably why I don’t use the term that much.
Drake counteroffers ‘committed to under this policy’ but no, I think that’s wrong. I think the right word is ‘intending.’
I’d use “Anthropic’s policy is to do X”. I think that’s fine for statements about the future too, e.g. “Under RSP v3, Anthropic’s policy is to do X for future powerful models”.
I think publicly adopting a policy is more meaningful than stating an intention, but there’s no implication that policies can’t be changed.
More precisely: I think these failures probably enables dutch books that CDT isn’t susceptible to.
So it’s not just that insufficient updatelessness fails to capture some potential value that you could have gotten if you were more updateless. It’s that it’s actively worse than CDT in some cases.
And that’s a large part of why I now feel more unhappy about EDT. I feel like I have a better sense of what my beliefs are than what my prior is. If EDT requires me to act according to my prior (and will lead to me making stupid decisions if I instead act according to changing beliefs) then I’m not sure exactly how to do that.