Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
This newer model is being continually trained on oodles of data from a huge base of customers; they have it do all sorts of tasks and it tries and sometimes fails and sometimes succeeds and is trained to succeed more often.
My sense is that this isn’t a big part of the story for how new models’ capabilities are being increased. Though I don’t think we know for sure either way.
Now many millions of people are basically treating it like a coworker and virtual assistant. People are giving it their passwords and such and letting it handle life admin tasks for them, help with shopping, etc. and of course quite a lot of code is being written by it.
This seems accurate for coders. Is it true for people who aren’t coders? It’s not really true for my job or life admin tasks (like I use the models a fair bit, but it’s more in chat-bot mode than in agent-mode / trusting the models to do a lot of stuff for me) but maybe it’s more true for others.
Cool work! Some big similarities to imitative generalization. (Which is great, since imitative generalization seems like a theoretically promising direction but underspecified and I don’t know of empirical work studying it.)
Looking at your first figure, it seems like the consistently biggest cost that Earth datacenters have that space datacenters don’t is “OpEx”.
What does this include and why can you avoid it in space?
Searching through the article for mentions of operational expenditures, I find one table saying OpEx includes “Energy, Staffing/Maintenance, Taxes” and a longer mention saying:
On the operational side, terrestrial centers face ongoing energy bills, staffing, maintenance, and taxes, whereas ODCs largely eliminate these recurring costs by generating solar power freely in orbit.
Going through these:
Energy is accounted for elsewhere in the figure, so presumably that isn’t included.
Staffing/maintenance: Why is this cheaper in space?
I can see why space deployments would realistically have to find ways to avoid normal types of staffing and maintenance, but that’s because it’d be extremely expensive to do them in space, which is a downside of space. I don’t yet see the upside. In other words: If there’s ways to run a space datacenter that doesn’t require staff or maintenance, why can’t you use those same methods on Earth?
Taxes: What taxes are these that you avoid by going to space? (I don’t find anything on this when searching the article for “tax”.)
Wow I do find it funny how chatgpt just decided that the term for “(” and ”)” in spanish was “—”. Maybe a sign that it’s not the best choice of model for this task.
Also probably you can get a decent sense of how LLM-ish it is by translating it back to English again with an LLM?
That doesn’t seem like an analogous case to me. In the case you describe, it’s ex-ante optimal to commit to betting at 9999:1 odds. In the original case described in the post, it’s ex-ante optimal to commit to betting at 99:1 odds. That’s a big important difference.
But that’s not what is going on here since mrcSSA does not mean “be updateless”/”use your prior in EU calculations”, it just means don’t anthropically update your prior based on how many observers are in different worlds.
I don’t think there’s any kind of empirical update that’s not an anthropic update. Anthropic theories are supposed to generalize and clarify how updates work in confusing cases where multiple observers exist.
mrcSSA only updates to exclude worlds where no one with your experiences exist. If all worlds under your consideration contains someone with your experiences, mrcSSA doesn’t update, so in that sense it does end up using its prior in EU calculations.
Maybe relevant: https://jsteinhardt.stat.berkeley.edu/blog/film-study
So my current impression is basically that you’re optimistic about something similar to this:
Perhaps the most elegant thing would be to have an account of some particular kind of reasoning or learning that could be used to construct/refine priors, where we’d be able to show that it doesn’t run into the same issues as the ones we run into when we modify our beliefs in response to observations like this or in response to proofs like this.
And that your argument here is an argument for why it won’t be possible to make double-update arguments about this more narrow type of beliefs or reasoning. (Because it’s so fundamental that if it differs between agents, they can’t be correlated with each other.)
That argument seems plausible but not very robust to me at the moment. I mostly don’t have a great sense of what the “some beliefs to get [otherwise updateless] EDT off the ground”-paradigm looks like. Maybe I’ll look into it more.
I think the same problem arises for some empirical questions too—T1 and T2 can be questions like “is iron’s atomic number 26 or 27?” I would have been roughly 50-50 before looking it up, but I’m uncertain if I should try to cooperate with people living in worlds where the atomic number of iron is 27 - I don’t know if those worlds are compatible with life.
Minor: The question about whether those worlds are compatible with life seems like a logical rather than empirical question to me. So this still seems like an issue with logical updatelessness rather than empirical updatelessness.
As an example: when I’m betting on the atomic number of iron, I shouldn’t think of myself as cooperating with versions of myself who live in a world where iron has 27 protons. Those worlds might not exist. But I’m cooperating with instances where the game-master decided to ask if iron has 25 or 26 protons.
As in: If you’re in a counterfactual mugging where Omega says they’d reward you in a world where iron has 27 protons if you pay in this world, then you pay because you expect there to be a bunch of Omegas elsewhere doing other logical counterfactual muggings. In roughly half of those cases, someone is about to get paid by omega if their impossible counterpart pays. And your action provides evidence that their impossible counterpart pays and that Omega gives them the reward.
And the same structure applies in the calculator example and the “conjunction of two theorems” example, because you’re correlated with a bunch of other distant people where you have so little information about the details of their situation that your epistemic position is “ex-ante” relative to their dilemma, so even if you’re updateful, you bet to optimize ex-ante utility in your case, to get evidence that they bet to optimize ex-ante utility in their case.
Hm, maybe that’s right.
Doesn’t that feel really unsatisfying though? I still feel like updateful EDT recommends wrong actions in important test cases. It’s just a contingent fact that most dilemmas like this will be small-scale in a large-scale universe, and that EDT’s recommendation to act as if you double-update will be swamped by not wanting to get evidence that other people elsewhere double-updates. And there’s always going to be that force pushing towards recommending double-updating, so if you ever get evidence that a decision is high-stakes enough and universal enough throughout the universe, EDT may well recommend that you make an exception for it and do a proper double-update on it, which seems bad.
Markets can be better or worse depending on eg liquidity. My guess would be that today’s markets are better. (The large difference between 83 and 91 cents failing to disappear from arbitrage is an indication that at least one of those markets weren’t so great, though I haven’t checked how current markets look on that metric.)
Maybe I’m confused about how much you believe that my actual life history matters. I think in the case of empirical updatelessness, my life history doesn’t really matter—I will eventually try to trade with people in proportion to something like their measure in the Solomonoff prior, and not with worlds where Austria and Australia are the same country, even though I was uncertain about this empirical fact when I was 5. (Do you agree with this, or do you think life history also matters for empirically updateless trade?)
If you were 100% empirically updateless and 0% logically updateless, then I think your life history wouldn’t matter except insofar as it led you to learning different logical facts. Insofar as you would eventually reach ‘logical maturity’ regardless of your life history, and learn some vast and similar swaths of logical facts, then yeah, eventually your life history wouldn’t matter.
I expect that logical updatelessness is similar—I will try to use some elegant construction like the Solomonoff prior to put a weight on different logical counterfactuals, and it won’t matter how my prior was constructed in my childhood.
My current understanding is that this requires updating on a bunch of logical arguments (about what that elegant construction should be and what it implies) and can therefore get you dutch-booked.
In many ways I like the baseline strategy of “ignore decision theory, act in ways that heuristically seem like they gather option-value, figure out all the hard stuff with the help of superintelligence”.
I guess your proposal is similar to that except there’s an addition of “we have a hunch that something like ECL works and that this means we should be a bit more cooperative, so we’ll be a bit more cooperative”.
But for some purposes, it does seem useful to know the implications of decision theory. A few examples (some more important than others):
Is there categories of information that people can plausibly be harmed by learning, that we should try to avoid, or will it clearly be correct (on reflection) to be retroactively updateless?
Understand more detailed implications of ECL, like who we should cooperate with and how much.
Should we do some crazy DT-motivated pseudo-alignment scheme like this or this.
For the purposes of answering questions like this, I’m interested in whether I basically buy EDT (and with what kind of updatelessness) or if the real answer to DT is going to be pretty different.
One concern here is that EDT is maybe less of coherent or appropriate decision theory than I thought. I don’t really think your plan addresses that. Like, your plan talks about how we’ll have lots of resources to think about what our prior should be, but doesn’t really address this part:
As someone who is sympathetic to EDT, I have arguments on the one hand that I shouldn’t update my prior on the basis of observations or proofs. (Indeed — arguments that suggest that such updates would lead to seemingly dumb and surely-avoidable accounting error.) But on the other hand, I need to do some sort of reasoning or learning to construct (or refine into coherence) my prior in the first place. I don’t currently know how to do this latter thing while avoiding the former thing.
These don’t seem like esoteric failures where it’s plausible that the hypothetical isn’t fair, or where the dynamic inconsistency is plausibly explained as being caused by changing values, or where there’s a fundamental tradeoff between the harms and value of new information. The errors happen in very mundane situations, and to me they look like dumb and surely-avoidable accounting errors.
More precisely: I think these failures probably enables dutch books that CDT isn’t susceptible to.
So it’s not just that insufficient updatelessness fails to capture some potential value that you could have gotten if you were more updateless. It’s that it’s actively worse than CDT in some cases.
And that’s a large part of why I now feel more unhappy about EDT. I feel like I have a better sense of what my beliefs are than what my prior is. If EDT requires me to act according to my prior (and will lead to me making stupid decisions if I instead act according to changing beliefs) then I’m not sure exactly how to do that.
Here’s my current picture of EDT and UDT.
In situations where EDT agents have many copies or near-copies, an EDT agent operates by imagining that it simultaneously controls the decisions of all those copies. This works very elegantly as long as it optimizes with respect to its prior and (upon learning new information) just changes its beliefs about what people in the prior it can control the actions of. (I.e., when it sees a blue sky, it shouldn’t change its prior to exclude worlds without blue skies, but it should make its next decision to optimize argmax_U[EV_prior(U|”an agent like me who has seen a blue sky would take action A”)]
As described here, EDT agents will act in very strange ways if they also update their prior upon observing evidence. As described here, EDT agents will act similarly strangely if they update their logical priors upon deriving a proof. (Though the logical situation is somewhat less bad than the empirical situation.) These don’t seem like esoteric failures where it’s plausible that the hypothetical isn’t fair, or where the dynamic inconsistency is plausibly explained as being caused by changing values, or where there’s a fundamental tradeoff between the harms and value of new information. The errors happen in very mundane situations, and to me they look like dumb and surely-avoidable accounting errors.
Unfortunately, the most obvious solution to the problem is something like “stick with your prior and never change it”. And that’s not really available as a solution to a bounded agent like me.
I don’t have a complete and coherent prior of the world. I can procedurally generate beliefs when asked about them, and you could try to construct a complete set of beliefs (perhaps to be used as priors) out of this (e.g. you could say that I “currently believe X with probability p” if I would respond with probability p upon being asked about X and given 1 minute to think). But any such set of beliefs would be very contradictory and incoherent. And I suspect that EDT might not look as good if the prior it starts with is very contradictory and incoherent.[1]
This creates a bit of a puzzle. As someone who is sympathetic to EDT, I have arguments on the one hand that I shouldn’t update my prior on the basis of observations or proofs. (Indeed — arguments that suggest that such updates would lead to seemingly dumb and surely-avoidable accounting error.) But on the other hand, I need to do some sort of reasoning or learning to construct (or refine into coherence) my prior in the first place. I don’t currently know how to do this latter thing while avoiding the former thing.
Perhaps the most elegant thing would be to have an account of some particular kind of reasoning or learning that could be used to construct/refine priors, where we’d be able to show that it doesn’t run into the same issues as the ones we run into when we modify our beliefs in response to observations like this or in response to proofs like this. Or maybe it’s EDT that needs to change, or the idea of making decisions based on priors.
Misc points:
I think open-minded updatelessness is an attempt at something like this, where “awareness growth” is a separate type of operation from normal updating, and only awareness growth is allowed to modify the prior. I find it hard to evaluate because I don’t know how awareness growth is supposed to mechanically work.
I don’t think FDT obviously does better than EDT here.
You might hope that the question of “what’s the prior?” might turn out to not be so important, as long as we eventually receive enough evidence about what subsets of universes we have the most impact in. However, it looks to me like the prior may be extremely important, because it seems plausible to me that EDT recommends doing ECL from the perspective of the prior. If so, what values we benefit in the future may be primarily determined by their frequency in our prior.
I’m not sure though. Maybe there’s some minimum level of coherence that is sufficient to motivate reasoning from an ex-ante perspective. This report from Martin Soto tries to construct somewhat coherent and complete priors from logical inductors run for a finite amount of time, and then do updatelessness on the basis of them. That seems like a promising place to look for further insights.
I do think there’s a sense in which CDT behavior is evolutionarily selected for in environments where agents can’t see each others’ decision theories.
I don’t see this as a big problem with UDT. If UDT wants to be evolutionarily fit relative to other agents in the environment, then I think they could adopt CDT behavior and do just as well as CDT.
It’s just that, due to the virtue of their decision theory (according to themselves), they have the option of giving up evolutionary fitness in exchange for higher utility in the short run. If they care more about short-run utility than evolutionary fitness, then perhaps they take the deal.
I don’t think this option is a strike against UDT. In any situation where agents care about X but we’re scoring them on Y, there will be scenarios where their Y-score gets hurt if we give them tools for achieving X which trades off against Y.
If there’s a population of mostly TDT/UDT agents and few CDT agents (and nobody knows who the CDT agents are) and they’re randomly paired up to play one-shot PD, then the CDT agents do better. What does this imply?
Maybe you’d get the same effect if you had 100% UDT agents but with 99% being in blue rooms and 1% being in red rooms. The ones in red rooms would reason that they could defect against the ones in blue rooms because they are in a relevantly different situation due to being in a minority that can easily coordinate defection against the majority. (With the majority still being motivated to cooperate even if they are only correlated with each other.) If so, there’s a sense in which the CDT agents aren’t benefitting anymore than they would if they were UDT agents who got a CDT sticker.
(Note that the red room / blue room thing doesn’t fundamentally break correlations here. Two UDT agents who are playing a symmetric game against each other, when one is in a blue room and one is in a red room, would still be able to cooperate. The thing that breaks the correlation is that the people in the red room are in an easily identifiable minority in a game where a minority can benefit from defection. Which isn’t true in symmetric PD.)
This would raise the question about what’d happen if all the UDT agents were given different serial numbers. Is there some serial numbers that could defect without negative evidence about the other UDT agents?
Hm, in order for this to work in practice, the UDT agents would have to have their serial number or room-color assignment already be present in their prior. If it’s information they receive later-on, they should probably be updateless about it and just cooperate even if they’re in the minority.
I think this is a strong argument that EDT agents shouldn’t do bayesian updates on empirical observations. I thought that it might still be ok to change your mind on the basis of logical arguments and reasoning (not empirical data or observations). But I think a very similar argument bites against that.
Example:
There’s 2 mathematical propositions, each of which you think have an independent 50% probability of being true: Y1 and Y2.
The proposition X = “Y1 and Y2”.
Presumably you assign 25% to X being true.
Let’s say you try to prove Y1 to be true, and succeed. You don’t have time to prove Y2. Naively, you’d know expect to assign 50% to X being true, or be willing to be on 1:1 odds.
However, let’s say that there are many copies of you across the universe, and equally many of them tried to prove statement Y1 and Y2. For simplicity, let’s say everyone who tried to prove a true statement succeeded, and no one had time to attempt to prove more than one statement.
Given an opportunity to bet on X being true, and thinking about your odds, you reason:
If X is true, then Y2 will be true (in addition to Y1).
So if X is true, and I bet on X, then everyone will bet on X and everyone will win. (Assuming that someone who proved Y2 is relevantly in the same position as me, so that my choosing to bet provides strong evidence that they will bet.)
If X is false, then Y2 will be fase.
So if X is false, and everyone in my position bets on X, in expectation just 1⁄2 people will lose. (The ones who proved Y1.)
So the stakes are twice as high if X is true than false.
Since I assign X a 50% chance (or 1:1) of being true, I will bet on (1:1) * (2:1) = (2:1) odds that X is true. I.e., from this perspective, the EV calculation becomes:
EV(bet on 2:1 odds) = 50% [that X is true] * 2 [people who win if X is true] * 1 [payout if X is true] + 50% [that X is false] * 1 [people who lose if X is false] * (-2) [payout if X is false] = 0.
This strategy will lose money in expectation.
From the ex-ante perspective, there are 4 equiprobable worlds where (Y1,Y2) have different truth values. In 1 of them, neither is true; in 2 of them, exactly 1 is true; and in 1 one of them, both are true. From the ex-ante perspective, there’s 2 people who prove their statement true when X is false; and 2 people who prove their statement true when X is true. If they all bet at 2:1 odds that X is true, they’ll lose money in expectation.
One difference from the empirical case in the post above is that you need to perceive yourself as correlated with people who proved a different statement than you did.
Edit 2026-04-20:
I significantly simplified the example above.
I want to flag that I think the argument against logical updates is somewhat weaker than the argument against empirical updates. In particular, this even more unappealing argument doesn’t apply to the logical case as far as I can tell. (And the toy example above — disjunction of logical statements where you’ve proven one — is more rare than unreliable empirical evidence of logical facts, as in the calculator example above.)
Edit 2026-05-05: Dutch book variant:
Before proving, the bookie offers:
If X is true → you pay $1
If [the statement you’re about to prove] is true and X is false → you get $1.1
Then if you fail to prove your statement, revealing it to be false, no pay-out happens.
If you do prove your statement, the bookie offers:
If X is true → you get $0.7.
If [the statement you just proved] is true and X is false → you pay $1.2.
(2nd offer is accepted because the EDT agent perceives the stakes to be twice as high if X is true.)
Overall payout:
[your statement] is false → $0
X is true → -$0.3
[your statement] is true and X is false → -$0.1
seems very plausible that our long-term civilizational trajectory is significantly affected by which type of AGI gets built first
I of course agree, but I’d think this would mostly be an issue of capabilities or goodness of our future society, since there’s not much external to our society that’s getting worse as a result of the transition. Anyway, that seems like maybe one of those definitional issues. I think you’re probably right that there’s some possible changes that aren’t well characterized as being about the capabilities or goodness of our society, so an improvemet in those dimensions aren’t strictly speaking sufficient for a pause to not have been valuable.
I care more about my claim that started with “I just think there’s a decent chance...”. (Which is importantly only asserting a decent chance, not saying that there aren’t plausible ways it could be false.)
If it was truly at “what people imagined stage 4 to be”, you might think that you/Ege/Ajeya are supposed to assign 90%/30%/75% to AGI within the next ~2.5 years. (Though ofc you could have had other updates that cancel out something here.) I think in fact all of you are lower than your own numbers there.