Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
In your example, you know T1 and your counterpart knows T2. You see your behavior as correlated with the behavior of your counterpart. Under these conditions, it seems like T1 can’t possibly be so fundamental that you need it in order to do EDT-style reasoning (otherwise your counterpart couldn’t). So even if you grant that you need some beliefs to get EDT off the ground, it seems like those beliefs must be in the intersection of T1 and T2, in which case you wouldn’t run into this problem.
Ok, so I think this argument probably works if all EDT agents have the same set of core “beliefs to get EDT off the ground” but fails if they have different such cores while still maintaining significant correlation with each other. At the moment, “all correlated EDT agents will share the same core” seems like a bold and implausible hypothesis to me. E.g. I think this implies they’d all share the same prior.
Consider the agents in the non-omniscience paper.[1] Their prior is based on maximum entropy and on ensuring logical coherence on some set of sentences S_0 and extensions of S_0. If two agents had different S_0 (or choices of how to extend S_0), then that might cause them to have different priors about T1 and T2. I find it likely that (i) the choice of S_0 is arbitrary enough that it would vary between agents,[2] (ii) that this could cause some differences in beliefs, and (iii) that those different agents would still be correlated enough to enable reasoning analogous to double updating.
Now, this wouldn’t actually enable dutch books, because there wouldn’t be any inconsistency over time. If you could read the agent’s minds, you’d see them reason about things in a double-update kind-of-way, but this would take place during the construction of their prior, so there wouldn’t be any dynamic inconsistency. But (i) I still find the double update to be a bad sign even if there’s no dynamic inconsistency, and (ii) I worry that the agents only escape dynamic inconsistency because they precompute their entire prior at the start, which is unrealistic and would likely to have to be replaced with some more dynamic refining of their prior for realistic agents, which may then cause dynamic inconsistency.
More precisely, let’s consider an updateless version of them: That maximizes E(U|”I take action A on input B”). The version in the paper instead maximizes E(U|”I take action A”, B) which is straightforwardly updateful and therefore straightforwardly double-updates on all kinds of observations.
The paper just says “For now, we will assume that we have already identified a reasonably-sized set S0 containing some sentences of interest to us—descriptions of what we may observe or decide, statements about what we value, explanatory hypotheses which might account for our observations, and so on.” This seems like a pretty different approach than seeking the absolute minimum S0, where if an EDT agent could at all function with one fewer sentence, we’d always remove that sentence.
If it was truly at “what people imagined stage 4 to be”, you might think that you/Ege/Ajeya are supposed to assign 90%/30%/75% to AGI within the next ~2.5 years. (Though ofc you could have had other updates that cancel out something here.) I think in fact all of you are lower than your own numbers there.
This newer model is being continually trained on oodles of data from a huge base of customers; they have it do all sorts of tasks and it tries and sometimes fails and sometimes succeeds and is trained to succeed more often.
My sense is that this isn’t a big part of the story for how new models’ capabilities are being increased. Though I don’t think we know for sure either way.
Now many millions of people are basically treating it like a coworker and virtual assistant. People are giving it their passwords and such and letting it handle life admin tasks for them, help with shopping, etc. and of course quite a lot of code is being written by it.
This seems accurate for coders. Is it true for people who aren’t coders? It’s not really true for my job or life admin tasks (like I use the models a fair bit, but it’s more in chat-bot mode than in agent-mode / trusting the models to do a lot of stuff for me) but maybe it’s more true for others.
Cool work! Some big similarities to imitative generalization. (Which is great, since imitative generalization seems like a theoretically promising direction but underspecified and I don’t know of empirical work studying it.)
Looking at your first figure, it seems like the consistently biggest cost that Earth datacenters have that space datacenters don’t is “OpEx”.
What does this include and why can you avoid it in space?
Searching through the article for mentions of operational expenditures, I find one table saying OpEx includes “Energy, Staffing/Maintenance, Taxes” and a longer mention saying:
On the operational side, terrestrial centers face ongoing energy bills, staffing, maintenance, and taxes, whereas ODCs largely eliminate these recurring costs by generating solar power freely in orbit.
Going through these:
Energy is accounted for elsewhere in the figure, so presumably that isn’t included.
Staffing/maintenance: Why is this cheaper in space?
I can see why space deployments would realistically have to find ways to avoid normal types of staffing and maintenance, but that’s because it’d be extremely expensive to do them in space, which is a downside of space. I don’t yet see the upside. In other words: If there’s ways to run a space datacenter that doesn’t require staff or maintenance, why can’t you use those same methods on Earth?
Taxes: What taxes are these that you avoid by going to space? (I don’t find anything on this when searching the article for “tax”.)
Wow I do find it funny how chatgpt just decided that the term for “(” and ”)” in spanish was “—”. Maybe a sign that it’s not the best choice of model for this task.
Also probably you can get a decent sense of how LLM-ish it is by translating it back to English again with an LLM?
That doesn’t seem like an analogous case to me. In the case you describe, it’s ex-ante optimal to commit to betting at 9999:1 odds. In the original case described in the post, it’s ex-ante optimal to commit to betting at 99:1 odds. That’s a big important difference.
But that’s not what is going on here since mrcSSA does not mean “be updateless”/”use your prior in EU calculations”, it just means don’t anthropically update your prior based on how many observers are in different worlds.
I don’t think there’s any kind of empirical update that’s not an anthropic update. Anthropic theories are supposed to generalize and clarify how updates work in confusing cases where multiple observers exist.
mrcSSA only updates to exclude worlds where no one with your experiences exist. If all worlds under your consideration contains someone with your experiences, mrcSSA doesn’t update, so in that sense it does end up using its prior in EU calculations.
Maybe relevant: https://jsteinhardt.stat.berkeley.edu/blog/film-study
So my current impression is basically that you’re optimistic about something similar to this:
Perhaps the most elegant thing would be to have an account of some particular kind of reasoning or learning that could be used to construct/refine priors, where we’d be able to show that it doesn’t run into the same issues as the ones we run into when we modify our beliefs in response to observations like this or in response to proofs like this.
And that your argument here is an argument for why it won’t be possible to make double-update arguments about this more narrow type of beliefs or reasoning. (Because it’s so fundamental that if it differs between agents, they can’t be correlated with each other.)
That argument seems plausible but not very robust to me at the moment. I mostly don’t have a great sense of what the “some beliefs to get [otherwise updateless] EDT off the ground”-paradigm looks like. Maybe I’ll look into it more.
I think the same problem arises for some empirical questions too—T1 and T2 can be questions like “is iron’s atomic number 26 or 27?” I would have been roughly 50-50 before looking it up, but I’m uncertain if I should try to cooperate with people living in worlds where the atomic number of iron is 27 - I don’t know if those worlds are compatible with life.
Minor: The question about whether those worlds are compatible with life seems like a logical rather than empirical question to me. So this still seems like an issue with logical updatelessness rather than empirical updatelessness.
As an example: when I’m betting on the atomic number of iron, I shouldn’t think of myself as cooperating with versions of myself who live in a world where iron has 27 protons. Those worlds might not exist. But I’m cooperating with instances where the game-master decided to ask if iron has 25 or 26 protons.
As in: If you’re in a counterfactual mugging where Omega says they’d reward you in a world where iron has 27 protons if you pay in this world, then you pay because you expect there to be a bunch of Omegas elsewhere doing other logical counterfactual muggings. In roughly half of those cases, someone is about to get paid by omega if their impossible counterpart pays. And your action provides evidence that their impossible counterpart pays and that Omega gives them the reward.
And the same structure applies in the calculator example and the “conjunction of two theorems” example, because you’re correlated with a bunch of other distant people where you have so little information about the details of their situation that your epistemic position is “ex-ante” relative to their dilemma, so even if you’re updateful, you bet to optimize ex-ante utility in your case, to get evidence that they bet to optimize ex-ante utility in their case.
Hm, maybe that’s right.
Doesn’t that feel really unsatisfying though? I still feel like updateful EDT recommends wrong actions in important test cases. It’s just a contingent fact that most dilemmas like this will be small-scale in a large-scale universe, and that EDT’s recommendation to act as if you double-update will be swamped by not wanting to get evidence that other people elsewhere double-updates. And there’s always going to be that force pushing towards recommending double-updating, so if you ever get evidence that a decision is high-stakes enough and universal enough throughout the universe, EDT may well recommend that you make an exception for it and do a proper double-update on it, which seems bad.
Markets can be better or worse depending on eg liquidity. My guess would be that today’s markets are better. (The large difference between 83 and 91 cents failing to disappear from arbitrage is an indication that at least one of those markets weren’t so great, though I haven’t checked how current markets look on that metric.)
Maybe I’m confused about how much you believe that my actual life history matters. I think in the case of empirical updatelessness, my life history doesn’t really matter—I will eventually try to trade with people in proportion to something like their measure in the Solomonoff prior, and not with worlds where Austria and Australia are the same country, even though I was uncertain about this empirical fact when I was 5. (Do you agree with this, or do you think life history also matters for empirically updateless trade?)
If you were 100% empirically updateless and 0% logically updateless, then I think your life history wouldn’t matter except insofar as it led you to learning different logical facts. Insofar as you would eventually reach ‘logical maturity’ regardless of your life history, and learn some vast and similar swaths of logical facts, then yeah, eventually your life history wouldn’t matter.
I expect that logical updatelessness is similar—I will try to use some elegant construction like the Solomonoff prior to put a weight on different logical counterfactuals, and it won’t matter how my prior was constructed in my childhood.
My current understanding is that this requires updating on a bunch of logical arguments (about what that elegant construction should be and what it implies) and can therefore get you dutch-booked.
In many ways I like the baseline strategy of “ignore decision theory, act in ways that heuristically seem like they gather option-value, figure out all the hard stuff with the help of superintelligence”.
I guess your proposal is similar to that except there’s an addition of “we have a hunch that something like ECL works and that this means we should be a bit more cooperative, so we’ll be a bit more cooperative”.
But for some purposes, it does seem useful to know the implications of decision theory. A few examples (some more important than others):
Is there categories of information that people can plausibly be harmed by learning, that we should try to avoid, or will it clearly be correct (on reflection) to be retroactively updateless?
Understand more detailed implications of ECL, like who we should cooperate with and how much.
Should we do some crazy DT-motivated pseudo-alignment scheme like this or this.
For the purposes of answering questions like this, I’m interested in whether I basically buy EDT (and with what kind of updatelessness) or if the real answer to DT is going to be pretty different.
One concern here is that EDT is maybe less of coherent or appropriate decision theory than I thought. I don’t really think your plan addresses that. Like, your plan talks about how we’ll have lots of resources to think about what our prior should be, but doesn’t really address this part:
As someone who is sympathetic to EDT, I have arguments on the one hand that I shouldn’t update my prior on the basis of observations or proofs. (Indeed — arguments that suggest that such updates would lead to seemingly dumb and surely-avoidable accounting error.) But on the other hand, I need to do some sort of reasoning or learning to construct (or refine into coherence) my prior in the first place. I don’t currently know how to do this latter thing while avoiding the former thing.
These don’t seem like esoteric failures where it’s plausible that the hypothetical isn’t fair, or where the dynamic inconsistency is plausibly explained as being caused by changing values, or where there’s a fundamental tradeoff between the harms and value of new information. The errors happen in very mundane situations, and to me they look like dumb and surely-avoidable accounting errors.
More precisely: I think these failures probably enables dutch books that CDT isn’t susceptible to.
So it’s not just that insufficient updatelessness fails to capture some potential value that you could have gotten if you were more updateless. It’s that it’s actively worse than CDT in some cases.
And that’s a large part of why I now feel more unhappy about EDT. I feel like I have a better sense of what my beliefs are than what my prior is. If EDT requires me to act according to my prior (and will lead to me making stupid decisions if I instead act according to changing beliefs) then I’m not sure exactly how to do that.
Here’s my current picture of EDT and UDT.
In situations where EDT agents have many copies or near-copies, an EDT agent operates by imagining that it simultaneously controls the decisions of all those copies. This works very elegantly as long as it optimizes with respect to its prior and (upon learning new information) just changes its beliefs about what people in the prior it can control the actions of. (I.e., when it sees a blue sky, it shouldn’t change its prior to exclude worlds without blue skies, but it should make its next decision to optimize argmax_U[EV_prior(U|”an agent like me who has seen a blue sky would take action A”)]
As described here, EDT agents will act in very strange ways if they also update their prior upon observing evidence. As described here, EDT agents will act similarly strangely if they update their logical priors upon deriving a proof. (Though the logical situation is somewhat less bad than the empirical situation.) These don’t seem like esoteric failures where it’s plausible that the hypothetical isn’t fair, or where the dynamic inconsistency is plausibly explained as being caused by changing values, or where there’s a fundamental tradeoff between the harms and value of new information. The errors happen in very mundane situations, and to me they look like dumb and surely-avoidable accounting errors.
Unfortunately, the most obvious solution to the problem is something like “stick with your prior and never change it”. And that’s not really available as a solution to a bounded agent like me.
I don’t have a complete and coherent prior of the world. I can procedurally generate beliefs when asked about them, and you could try to construct a complete set of beliefs (perhaps to be used as priors) out of this (e.g. you could say that I “currently believe X with probability p” if I would respond with probability p upon being asked about X and given 1 minute to think). But any such set of beliefs would be very contradictory and incoherent. And I suspect that EDT might not look as good if the prior it starts with is very contradictory and incoherent.[1]
This creates a bit of a puzzle. As someone who is sympathetic to EDT, I have arguments on the one hand that I shouldn’t update my prior on the basis of observations or proofs. (Indeed — arguments that suggest that such updates would lead to seemingly dumb and surely-avoidable accounting error.) But on the other hand, I need to do some sort of reasoning or learning to construct (or refine into coherence) my prior in the first place. I don’t currently know how to do this latter thing while avoiding the former thing.
Perhaps the most elegant thing would be to have an account of some particular kind of reasoning or learning that could be used to construct/refine priors, where we’d be able to show that it doesn’t run into the same issues as the ones we run into when we modify our beliefs in response to observations like this or in response to proofs like this. Or maybe it’s EDT that needs to change, or the idea of making decisions based on priors.
Misc points:
I think open-minded updatelessness is an attempt at something like this, where “awareness growth” is a separate type of operation from normal updating, and only awareness growth is allowed to modify the prior. I find it hard to evaluate because I don’t know how awareness growth is supposed to mechanically work.
I don’t think FDT obviously does better than EDT here.
You might hope that the question of “what’s the prior?” might turn out to not be so important, as long as we eventually receive enough evidence about what subsets of universes we have the most impact in. However, it looks to me like the prior may be extremely important, because it seems plausible to me that EDT recommends doing ECL from the perspective of the prior. If so, what values we benefit in the future may be primarily determined by their frequency in our prior.
I’m not sure though. Maybe there’s some minimum level of coherence that is sufficient to motivate reasoning from an ex-ante perspective. This report from Martin Soto tries to construct somewhat coherent and complete priors from logical inductors run for a finite amount of time, and then do updatelessness on the basis of them. That seems like a promising place to look for further insights.
I do think there’s a sense in which CDT behavior is evolutionarily selected for in environments where agents can’t see each others’ decision theories.
I don’t see this as a big problem with UDT. If UDT wants to be evolutionarily fit relative to other agents in the environment, then I think they could adopt CDT behavior and do just as well as CDT.
It’s just that, due to the virtue of their decision theory (according to themselves), they have the option of giving up evolutionary fitness in exchange for higher utility in the short run. If they care more about short-run utility than evolutionary fitness, then perhaps they take the deal.
I don’t think this option is a strike against UDT. In any situation where agents care about X but we’re scoring them on Y, there will be scenarios where their Y-score gets hurt if we give them tools for achieving X which trades off against Y.
If there’s a population of mostly TDT/UDT agents and few CDT agents (and nobody knows who the CDT agents are) and they’re randomly paired up to play one-shot PD, then the CDT agents do better. What does this imply?
Maybe you’d get the same effect if you had 100% UDT agents but with 99% being in blue rooms and 1% being in red rooms. The ones in red rooms would reason that they could defect against the ones in blue rooms because they are in a relevantly different situation due to being in a minority that can easily coordinate defection against the majority. (With the majority still being motivated to cooperate even if they are only correlated with each other.) If so, there’s a sense in which the CDT agents aren’t benefitting anymore than they would if they were UDT agents who got a CDT sticker.
(Note that the red room / blue room thing doesn’t fundamentally break correlations here. Two UDT agents who are playing a symmetric game against each other, when one is in a blue room and one is in a red room, would still be able to cooperate. The thing that breaks the correlation is that the people in the red room are in an easily identifiable minority in a game where a minority can benefit from defection. Which isn’t true in symmetric PD.)
This would raise the question about what’d happen if all the UDT agents were given different serial numbers. Is there some serial numbers that could defect without negative evidence about the other UDT agents?
Hm, in order for this to work in practice, the UDT agents would have to have their serial number or room-color assignment already be present in their prior. If it’s information they receive later-on, they should probably be updateless about it and just cooperate even if they’re in the minority.
...
Seems worth flagging that if you were facing a version of omicron that wasn’t able to predict (or observe) whether you’d go with CDT or LDT, then even if omicron can predict the output of both CDT and LDT, your-full-self-which-omicron-can’t-predict is not incentivized to self-modify into a version of LDT that one-boxes against omicron.
I.e., I think that “you who are struggling with what to decide” would be incentivized to self-modify into “son-of ‘you who are struggling with what to decide’” which may well act differently from a purer form of LDT that didn’t care about its history.