Here’s a story for how we could get lots of AI help with AI safety research even if schemers are somewhat common and diffuse control doesn’t work to get them to help us:
Let’s say a company successfully solves control for “high-stakes”/”concentrated” threats, meaning that AIs can safely be deployed without a large risk of them hacking their own datacenters or exfiltrating or causing any similar catastrophic action. (At a given level of capability.)
However, let’s say that control fails to solve the problem of getting the AIs to try hard at doing good AI safety research. When asked to do research, perhaps the schemers are sloppy, or perhaps they are actively trying to mislead us.
Still, the AI company can now safely train and deploy lots of different AIs without fear of them exfiltrating or hacking the lab. Which gives the AI company lots of “shots on goal” of training AIs that generalize to earnestly helping us with automated alignment research. Maybe they develop a zoo with like 20 separate countries of geniuses in data centres, including:
Maximally instruction-following AIs, instructed to help us in various ways.
Consequentialist AIs trained to care about human values.
Consequentialist AIs trained to play-the-training-game/scheme on behalf of human values. (So that they don’t get replaced by other schemers. Similar to inoculation prompting.)
AIs trained to just love alignment research (or some other project) in their heart. They’re just like a curious scientist who is obsessed with solving alignment.
AIs designed to be maximally amenable to making deals, and then offered some great payment in exchange for AI safety research progress.
AI intentionally trained to be reward-seekers — and told that they will get maximum reward if they’re eventually judged to have done excellent AI safety research.
...
Perhaps most of these end up scheming against the AI company — but some end up genuinely aligned.
The company probably can’t initially tell apart the schemers from the aligned AIs, or indeed, tell which of the projects are succeeding at a day-to-day basis. (It’s just too hard to evaluate which research directions are genuinely promising vs. a waste of time.) But they run them all in parallel, and after several months of massively sped-up work, it’s likely much easier to adjudicate whose solutions makes sense to adopt.
This is similar to how, in well-functioning scientific fields, it’s not typically possible to get consensus on what speculative research directions are most promising. But once a research project or whole research agenda is completed, there are normally ways in which the value-add can be demonstrated.
And maybe all that automated labor is enough to save the day.
Downsides with this approach:
If you split your compute across 20 different types of AIs, then your favorite type of AI is going to only have 5% as much compute available to it as it would’ve had if you concentrated your bets. This might be costly.
You’re increasing the risk of deploying at least one schemer. So you’re increasing the risk of high-stakes catastrophes like AIs hacking your data center. (Which is why it’s important that concentrated/high-stakes control is going well enough that the baseline risk is low.)
(Note that you could use a similar method of “wide variety of diverse AIs” as monitors, which could be very helpful for high-stakes control.)
On the bright side, you’re probably increasing your chances of catching at least one AI red-handed, which could be useful for research or for gathering legible evidence that risk is high.
With such evidence in hand, maybe you could convince others to coordinate to slow down, so that you don’t have to pursue the crazy plan of “deploy dozens of different AI models, because any given model is pretty likely to be scheming against you”.
Overall, thinking about this makes me:
Somewhat more bullish on the value of high-stakes/concentrated control research.
More bullish on developing somewhat dubious ideas for how to prevent scheming. Even if an idea might increase scheming risk in expectation (!) if it just has a bunch of uncertainty/randomness that’s independent from other approaches, then it could still add value.
In discussions about whether AI companies should write their model specs to say X or Y, I’m more inclined to say that, at least internally, they should hedge their bets by deploying models of both the X variant and the Y variant.
(Though depending on what the X/Y disagreement is, this might not resolve the problem. E.g. if the concern is that the X variant will be too easy to misuse by AI company insiders, then having the X model be available as one model might be almost as bad as it having it be the AI company’s only model.)
Invest in second-best techniques (basically what you say about “developing somewhat dubious ideas for how to prevent scheming”)
Modular systems—i.e. architectures where you can “toggle” some variable at inference-time to give you more variants, e.g. gradient routing
Prioritize late-stage techniques, i.e. techniques which apply late in AI lifecycle (e.g. post-training, prompting, activation steering, scaffolding) and deprioritise techniques apply early in the AI lifecycle (e.g. architecture design and pretraining). because this makes it cheaper to develop many variants
However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel. That is, I disagree with “If you split your compute across 20 different types of AIs, then your favorite type of AI is going to only have 5% as much compute available to it as it would’ve had if you concentrated your bets.” This is because I think we can run things like debate between all the variants.
One upshot of this difference is that I expect coordination between labs to matter significantly: if lab A starts with 100% schemers and lab B starts with 5% schemers, then we’ll elicit useful work from the AIs if the labs cross-examine each other’s AI research with their own AI variants.
However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel.
Yeah, I think the case for diverse monitors to achieve high-stakes control is actually a lot more straightforward. It’s easy to implement a monitor in a way such that it has very little ability to autonomously cause bad things to happen, so basically the only cost is the extra compute it takes to train and run it. And in the high-stakes/concentrated failures regime, humans will typically be able to recognize attempted catastrophes once pointed out to them.
I agree that debate-style approaches seem good. I do think it seems a lot more feasible to judge such debates once the AIs have already accomplished a lot of work, vs. early on in the research when AIs are making taste-driven judgments about direction, so it complements nicely with letting AIs do a lot of work in parallel also.
a company successfully solves control for “high-stakes”/”concentrated” threats, meaning that AIs can safely be deployed without a large risk of them hacking their own datacenters or exfiltrating or causing any similar catastrophic action.
This seems like quite a difficult task, especially since we (or some other trusted agent) need to read the outputs of these AIs to get any value from their safety research. Schemers could self-exfiltrate by manipulating us somehow.
The company probably can’t initially tell apart the schemers from the aligned AIs, or indeed, tell which of the projects are succeeding at a day-to-day basis. (It’s just too hard to evaluate which research directions are genuinely promising vs. a waste of time.) But they run them all in parallel, and after several months of massively sped-up work, it’s likely much easier to adjudicate whose solutions makes sense to adopt.
This is similar to how, in well-functioning scientific fields, it’s not typically possible to get consensus on what speculative research directions are most promising. But once a research project or whole research agenda is completed, there are normally ways in which the value-add can be demonstrated.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work. That’s a disanalogy with well-functioning scientific fields: scientists don’t deliberately pick bad research directions and try hard to make them seem good.
I think self-exfiltration via manipulation seems pretty hard. I think we’re likely to have transformatively useful systems that can’t do that, for some amount of time. (Especially since there’s no real reason to train them to be good at manipulation, though of course they might generalize from other stuff.) I agree people should definitely be thinking about it as a potential problem and try to mitigate and estimate the risk.
That’s a disanalogy with well-functioning scientific fields: scientists don’t deliberately pick bad research directions and try hard to make them seem good.
Some scientists do optimize for (something like) “impressiveness” of their work, regardless of whether it’s good or bad. It’s true that they don’t intentionally compromise impressiveness in order to make the work more misleading. That said, if some models optimize for impressiveness, and some models compromise impressiveness to make their work more misleading, then I guess the humans should be somewhat more likely to use the more impressive work? So maybe that limits the degree of influence that intentionally misleading work could I have.
I am definitely pretty worried about humans ability to judge what work is good or bad. In addition to the disanalogy you mention, I’d also highlight (i) non-experts typically struggle to judge which expert is right in a dispute, and if humans are slow & dumb enough relative to the AIs, humans might be analogous to non-experts, and (ii) it’s plausible to me that human science benefits a lot from the majority of scientists (and maybe especially the most capable scientists?) having at least a weak preference for truth, e.g. because it’s easier to recognize a few bad results when you can lean on the presumed veracity of most established results.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work
if you ask each variant to review the research of the other variants, then the schemers need to optimise their research for looking good to each variant. but the optimistic assumption is that at least one variant is an equally capable non-schemer.
then the schemers need to optimise their research for looking good to each variant
Not necessarily. The nice AIs also need to be able to win a debate against the schemers, as judged by humans. It’s not enough for the variants to be able to recognize poor research if they can’t show their work in an unrefutable (by the schemers) way.
In situations where EDT agents have many copies or near-copies, an EDT agent operates by imagining that it simultaneously controls the decisions of all those copies. This works very elegantly as long as it optimizes with respect to its prior and (upon learning new information) just changes its beliefs about what people in the prior it can control the actions of. (I.e., when it sees a blue sky, it shouldn’t change its prior to exclude worlds without blue skies, but it should make its next decision to optimize argmax_U[EV_prior(U|”an agent like me who has seen a blue sky would take action A”)]
As described here, EDT agents will act in very strange ways if they also update their prior upon observing evidence. As described here, EDT agents will act similarly strangely if they update their logical priors upon deriving a proof. (Though the logical situation is somewhat less bad than the empirical situation.) These don’t seem like esoteric failures where it’s plausible that the hypothetical isn’t fair, or where the dynamic inconsistency is plausibly explained as being caused by changing values, or where there’s a fundamental tradeoff between the harms and value of new information. The errors happen in very mundane situations, and to me they look like dumb and surely-avoidable accounting errors.
Unfortunately, the most obvious solution to the problem is something like “stick with your prior and never change it”. And that’s not really available as a solution to a bounded agent like me.
I don’t have a complete and coherent prior of the world. I can procedurally generate beliefs when asked about them, and you could try to construct a complete set of beliefs (perhaps to be used as priors) out of this (e.g. you could say that I “currently believe X with probability p” if I would respond with probability p upon being asked about X and given 1 minute to think). But any such set of beliefs would be very contradictory and incoherent. And I suspect that EDT might not look as good if the prior it starts with is very contradictory and incoherent.[1]
This creates a bit of a puzzle. As someone who is sympathetic to EDT, I have arguments on the one hand that I shouldn’t update my prior on the basis of observations or proofs. (Indeed — arguments that suggest that such updates would lead to seemingly dumb and surely-avoidable accounting error.) But on the other hand, I need to do some sort of reasoning or learning to construct (or refine into coherence) my prior in the first place. I don’t currently know how to do this latter thing while avoiding the former thing.
Perhaps the most elegant thing would be to have an account of some particular kind of reasoning or learning that could be used to construct/refine priors, where we’d be able to show that it doesn’t run into the same issues as the ones we run into when we modify our beliefs in response to observations like this or in response to proofs like this. Or maybe it’s EDT that needs to change, or the idea of making decisions based on priors.
Misc points:
I think open-minded updatelessness is an attempt at something like this, where “awareness growth” is a separate type of operation from normal updating, and only awareness growth is allowed to modify the prior. I find it hard to evaluate because I don’t know how awareness growth is supposed to mechanically work.
I don’t think FDT obviously does better than EDT here.
You might hope that the question of “what’s the prior?” might turn out to not be so important, as long as we eventually receive enough evidence about what subsets of universes we have the most impact in. However, it looks to me like the prior may be extremely important, because it seems plausible to me that EDT recommends doing ECL from the perspective of the prior. If so, what values we benefit in the future may be primarily determined by their frequency in our prior.
I’m not sure though. Maybe there’s some minimum level of coherence that is sufficient to motivate reasoning from an ex-ante perspective. This report from Martin Soto tries to construct somewhat coherent and complete priors from logical inductors run for a finite amount of time, and then do updatelessness on the basis of them. That seems like a promising place to look for further insights.
As described here, EDT agents will act in very strange ways if they also update their prior upon observing evidence.
I’ve seen this post cited a few times now in support of the idea that “EDT double counts”. But the betting behaviour in Calculator Bet seems just straightforwardly and intuitively correct to me (from an EDT perspective).
Here’s what seems like an analogous case: 0.5 prior on X being true. I look at a calculator that is 0.99 reliable. After I make my observation, if my observation is veridical (e.g., seeing “X is true” when X is actually true) then with 0.99 chance someone clones me N times, and with 0.01 chance nothing happens; if my observation is deceptive (e.g., seeing “X is true” when X is actually false) then with 0.01 chance someone clones me N times, and with 0.99 chance nothing happens.
In this case, like in the original Calculator Bet, I should “bet at 9999:1 odds”, but in both cases that’s because seeing “X is true” has two real updates: (i) X is probably true, and (ii) there are probably many clones of me. I feel like it’s because the odds are 0.99 in both parts that makes this feel like pathological “double counting”. If I make the calculator reliability Q and the subsequent duplication chance P, the behaviour here just seems unproblematically correct. (The fact that the 0.99 chance of cloning comes from i.i.d. experiments + law of large numbers in the Calculator Bet case and from a single randomisation device here should not matter.)
There is a separate point that EDT+(minimal reference class-SSA) is in general compatible with the ex ante/”UDT”-optimal policy and EDT+(anything that isn’t mrcSSA) is in general Dutch bookable. But that’s not what is going on here since mrcSSA does not mean “be updateless”/”use your prior in EU calculations”, it just means don’t anthropically update your prior based on how many observers are in different worlds.
[In the Calculator Bet case, if I additionally performed an SIA-update after seeing “X is true”, then I would be betting at 999999:1 odds, which would be very strange! But this is just the well-understood Dutch book against EDT+SIA from Briggs (2010).]
I don’t think that the updateful EDT behaviour in e.g. the calculator example is obviously problematic. Certainly not clearly worse than the alternative of just optimizing relative to the prior (cf. Anthony’s post).
I do think that the buy-and-copy behaviour from your example is bad, but it is bad because of how EDT manages the news, not because of the combination of EDT and anthropic updating per se. A counterfactual theory like FDT or TDT doesn’t manage the news and so doesn’t use the buy-and-copy strategy, AIUI. (Maybe similar cases could be constructed for counterfactual theories, though?)
To me, 1 and 2 suggest that we should consider a counterfactual theory (without going updateless relative to the prior), not just EDT + updateless relative to the prior.
In any case, I’m sympathetic to the ideal reflection principle, i.e., we should optimize our subjective expectation of the ideal agent’s expected utilities. So if you think the ideal agent is updateless relative to the prior, then you should make decisions based on your expectations of your expectations relative to this prior if you could compute it. (This includes the expected value of policies for handling reasoning/logical learning.) Of course it’s very unclear how to form such beliefs, but that doesn’t seem like a problem specific to updatelessness (i.e., it’s also unclear how to form beliefs about an updateful ideal agent).
Are there cases where EDT manages the news for reasons other than anthropic updating? I’m not aware of any, and if not then it this is exactly because of the interaction of EDT and anthropic updating.
To me, “managing the news” is just a description of how EDT works in general (i.e., EDT is definitionally about picking the action that gives us the best news). And I think EDT is problematic for that fundamental reason. I just think that Lukas’ case makes the silliness of news-management particularly vivid. (Other cases which arguably do so are XOR blackmail and this case.)
(I do think that Lukas’ case gets some of its counterintuitiveness from the fact that SIA has us weight worlds in proportion to how many copies of us they contain. But, again, that is just a counterintuitive property of SIA in general, which I think we ought to evaluate independently of how it interacts with decision theory.)
If “managing the news” just means “making a decision in situation X such that you are glad to hear the news that you made that decision in situation X,” then I agree that’s a description of EDT. I think it’s a priori reasonable to manage news about what you decide to do, so I don’t see this as a fundamental reason that EDT is problematic. I usually associate the phrase with various intuitive mistakes that EDT might make, and then I want to discuss concrete cases (like Lukas’) in which it appears an agent did something wrong.
These don’t seem like esoteric failures where it’s plausible that the hypothetical isn’t fair, or where the dynamic inconsistency is plausibly explained as being caused by changing values, or where there’s a fundamental tradeoff between the harms and value of new information. The errors happen in very mundane situations, and to me they look like dumb and surely-avoidable accounting errors.
More precisely: I think these failures probably enables dutch books that CDT isn’t susceptible to.
So it’s not just that insufficient updatelessness fails to capture some potential value that you could have gotten if you were more updateless. It’s that it’s actively worse than CDT in some cases.
And that’s a large part of why I now feel more unhappy about EDT. I feel like I have a better sense of what my beliefs are than what my prior is. If EDT requires me to act according to my prior (and will lead to me making stupid decisions if I instead act according to changing beliefs) then I’m not sure exactly how to do that.
In your example, you know T1 and your counterpart knows T2. You see your behavior as correlated with the behavior of your counterpart. Under these conditions, it seems like T1 can’t possibly be so fundamental that you need it in order to do EDT-style reasoning (otherwise your counterpart couldn’t). So even if you grant that you need some beliefs to get EDT off the ground, it seems like those beliefs must be in the intersection of T1 and T2, in which case you wouldn’t run into this problem.
I think there are a lot of pretty basic open questions about how logically bounded minds work, especially if they want them to be philosophically principled all the way down. That’s confusing enough that it’s totally plausible that you’ll come out of it with some TDT-like story, where causality is a basic cognitive ingredient that makes the bootstrapping work at all. My guess is that in the end causality will just be a helpful language for talking about certain kinds of independence relationships rather than a fundamental cognitive building block or something with direct normative implications, but who knows.
(Note for observers: I’m not expecting us to resolve any of those questions as part of AI safety.)
So my current impression is basically that you’re optimistic about something similar to this:
Perhaps the most elegant thing would be to have an account of some particular kind of reasoning or learning that could be used to construct/refine priors, where we’d be able to show that it doesn’t run into the same issues as the ones we run into when we modify our beliefs in response to observations like this or in response to proofs like this.
And that your argument here is an argument for why it won’t be possible to make double-update arguments about this more narrow type of beliefs or reasoning. (Because it’s so fundamental that if it differs between agents, they can’t be correlated with each other.)
That argument seems plausible but not very robust to me at the moment. I mostly don’t have a great sense of what the “some beliefs to get [otherwise updateless] EDT off the ground”-paradigm looks like. Maybe I’ll look into it more.
I might be missing something, but the situation doesn’t seem that bad to me.
I tentatively think that we should bullet-bite and be logically and empirically updateless EDT. Admittedly, it’s rough that I don’t really know how my prior looks like, but I think we can deal with that.
The way I imagine it is that following ECL comes in two separate stages. First, we need to take the best actions under ECL that we can as confused mortals. Second, we need need to decide how to use the resources of the universe once we are already surrounded by superintelligent AI advisors and had time to do a Long Reflection.
I think during the mortal phase, it’s okay that we don’t understand the prior very well. I think the only axioms we need to be in place before we become logically updateless are “it’s generally good if agents within the prior try to pursue their goals” and “we should try to follow ECL”. After these are in place, I think the best overall policy is for each agent to look around in their world and try to figure out how to get more optionality for themselves and the ECL coalition, while behaving in a way that feels like it should have robustly good correlated actions across the multiverse.
In our case, we can increase optionality by trying to make sure the stars don’t pointlessly burn out; agents following ECL eventually get in control of a big chunk of the world; there are good processes for growth and reflection. Meanwhile, it’s probably good to do less lying and backstabbing, and be merciful towards the weak, because these seem somewhat likely to have correlations with other actions across the multiverse that makes it more likely that ECL agents can follow their goals.
Occasionally, there is a toy example like Sleeping Beauty or the ones you listed in your post where we can explicitly reason about EDT, but usually we just follow the heuristics I described above. I don’t think we have a right to hope for much more clarity as mortals, and thankfully none of this requires knowing how the prior looks like.
In the second stage, when we stabilized the situation, figured out some good growth process, have superintelligent advisors and sent out probes to put out the fires of the distant stars, we still need to decide what we do with the resources. It seems plausible that, as you say “what values we benefit in the future may be primarily determined by their frequency in our prior”.
But figuring out the frequency of different beings in our prior doesn’t seem to me an especially intractable problem compared to the already scary question of what values we want to terminally pursue on reflection.
One idea I like for figuring out the prior is building a bigger and bigger coalition in an onion-like manner. We first run simulations of nearby quantum branches and pull out the cooperative people and AIs from there. We learn to live together, share our different perspectives, then we decide on a next broader distribution of quantum branches that we want to trade with, and so on.
I think logical updatelessness commits us to trying to trade with the logically counter-factual beings too at the end. That requires finding a Schelling-point distribution of logically counter-factual universes that the other universes will also agree on. But by the time we get there, we have already reflected on a lot of things, and learned to live together with dinosaur-people we pulled out of simulations. I feel that this giant coalition of humans, AIs and dinosaur-people, all using superintellignet advisors, can pull their ideas together and figure out how they imagine the logical prior and who they should trade with next. This doesn’t feel harder to me than other questions about the meaning of life that we will need to deal with.
In many ways I like the baseline strategy of “ignore decision theory, act in ways that heuristically seem like they gather option-value, figure out all the hard stuff with the help of superintelligence”.
I guess your proposal is similar to that except there’s an addition of “we have a hunch that something like ECL works and that this means we should be a bit more cooperative, so we’ll be a bit more cooperative”.
But for some purposes, it does seem useful to know the implications of decision theory. A few examples (some more important than others):
Is there categories of information that people can plausibly be harmed by learning, that we should try to avoid, or will it clearly be correct (on reflection) to be retroactively updateless?
Understand more detailed implications of ECL, like who we should cooperate with and how much.
Should we do some crazy DT-motivated pseudo-alignment scheme like this or this.
For the purposes of answering questions like this, I’m interested in whether I basically buy EDT (and with what kind of updatelessness) or if the real answer to DT is going to be pretty different.
One concern here is that EDT is maybe less of coherent or appropriate decision theory than I thought. I don’t really think your plan addresses that. Like, your plan talks about how we’ll have lots of resources to think about what our prior should be, but doesn’t really address this part:
As someone who is sympathetic to EDT, I have arguments on the one hand that I shouldn’t update my prior on the basis of observations or proofs. (Indeed — arguments that suggest that such updates would lead to seemingly dumb and surely-avoidable accounting error.) But on the other hand, I need to do some sort of reasoning or learning to construct (or refine into coherence) my prior in the first place. I don’t currently know how to do this latter thing while avoiding the former thing.
I don’t know, I still don’t see this as that bad of a sign for EDT. Yes, in the far future you will need to trade with people in you confusing and incoherent prior over logics.
But I think this is basically equivalent to handling the “what if you are in a simulation where the simulators intentionally messed with your brain to believe false things about certain logical statements” question. Admittedly, it’s a hard question, but I think everyone will need to deal with something like this.
Maybe I’m confused about how much you believe that my actual life history matters. I think in the case of empirical updatelessness, my life history doesn’t really matter—I will eventually try to trade with people in proportion to something like their measure in the Solomonoff prior, and not with worlds where Austria and Australia are the same country, even though I was uncertain about this empirical fact when I was 5. (Do you agree with this, or do you think life history also matters for empirically updateless trade?)
I expect that logical updatelessness is similar—I will try to use some elegant construction like the Solomonoff prior to put a weight on different logical counterfactuals, and it won’t matter how my prior was constructed in my childhood.
Maybe I’m confused about how much you believe that my actual life history matters. I think in the case of empirical updatelessness, my life history doesn’t really matter—I will eventually try to trade with people in proportion to something like their measure in the Solomonoff prior, and not with worlds where Austria and Australia are the same country, even though I was uncertain about this empirical fact when I was 5. (Do you agree with this, or do you think life history also matters for empirically updateless trade?)
If you were 100% empirically updateless and 0% logically updateless, then I think your life history wouldn’t matter except insofar as it led you to learning different logical facts. Insofar as you would eventually reach ‘logical maturity’ regardless of your life history, and learn some vast and similar swaths of logical facts, then yeah, eventually your life history wouldn’t matter.
I expect that logical updatelessness is similar—I will try to use some elegant construction like the Solomonoff prior to put a weight on different logical counterfactuals, and it won’t matter how my prior was constructed in my childhood.
My current understanding is that this requires updating on a bunch of logical arguments (about what that elegant construction should be and what it implies) and can therefore get you dutch-booked.
Yes, I agree the elegant construction will need to rely on some logical arguments, but I think tha’t not that bad.
The way I imagine trade to work is that I propose a distribution of chips among different universes which I would be happy to trade under. For example “every universe in the quantum multiverse each getting chips proportional to what the Born-rule prescribes” is a system I would be happy to trade under. Then I can see which other universes are willing to trade under this chip-distribution, and then we trade with each other using our chips.
I think this extends to trade among logically counter-factual worlds at least in some toy-examples. If an important historical event turned on someone making a bet on the billionth digit of pi, then the logically counterfactual worlds which were identical except the billionth digit of was different can probably make a trade deal among each other because they can all imagine this narrow logically counter-factual distribution, and they all recognize it as a Schelling-point.
I think we can probably go broader than that, and figure out an elegant distribution of chips among logical universes which a) we find fair in the sense that we value the resources of the other universes in proportion to their chips (just like in the Born-rule case), so we are happy to trade under this distribution and b) we think that many other universes in the distribution will share enough core features of our logic to recognize this distribution of chips as an elegant Schelling-point, and something they consider fair under their values. Then we trade with everyone who is willing to trade. Once we are done, we try to construct a broader coalition.
I think it’s likely that we won’t be able to expand the trade coalitions enough to cover all possible logic systems (what would that even mean?) Maybe we will never be able to deal with the guys living in the 1+1=3 universe, because we can’t imagine them, they can’t imagine us, and there is no distribution we both recognize as elegant and fair. In that case, we will leave some value on the table by not being able to trade with each other, but that’s life. I don’t think this means that we need to throw out this decision theory—it was nice enough if we got beneficial trades in as broad circles as we could.
---
Most concretely, I don’t see how you get Dutch-booked here. I tentatively think that any betting that a malicious Dutchman can come up with to get money out of me will be based on simple enough logical counter-factuals that I can form a mutually recognized distribution of logic systems among the affected parties and we will be fine.
Can you give a concrete example of how someone can pump money out of me?
Thinking more about it, I think I don’t stand by my original reply. It seems possible to have some theorems whose result I currently feel 50-50 about, but which are important enough that I’m at least uncertain if I will ever be able to build a broad enough coalition of logically counter-factual beings that include people where the opposite of the theorem is true.
I think the same problem arises for some empirical questions too—T1 and T2 can be questions like “is iron’s atomic number 26 or 27?” I would have been roughly 50-50 before looking it up, but I’m uncertain if I should try to cooperate with people living in worlds where the atomic number of iron is 27 - I don’t know if those worlds are compatible with life.
However, thinking through these examples, I think I now reject the premise that updateful EDT bets wrongly in your example of the two theorems or in Paul’s original calculator example.
I think in both cases the decision-correlational reference class you should take into account is not just you learning T1 is true and you learning T2 is true within this particular experiment. It’s every instance across the multiverse where beings similar to you need to make bets about questions they have no clue about. Taking all these correlations into account, the correct thing to do is to bet with 50-50.
(As an example: when I’m betting on the atomic number of iron, I shouldn’t think of myself as cooperating with versions of myself who live in a world where iron has 27 protons. Those worlds might not exist. But I’m cooperating with instances where the game-master decided to ask if iron has 25 or 26 protons.)
Separately, at the end of the days, I still want to do acausal trade with a broad coalition of worlds which might or might not include ones where iron has 27 protons and the T1 theorem is false. But I now think that this is a separate question, and updatelessness might not be required in our mortal life.
I think the same problem arises for some empirical questions too—T1 and T2 can be questions like “is iron’s atomic number 26 or 27?” I would have been roughly 50-50 before looking it up, but I’m uncertain if I should try to cooperate with people living in worlds where the atomic number of iron is 27 - I don’t know if those worlds are compatible with life.
Minor: The question about whether those worlds are compatible with life seems like a logical rather than empirical question to me. So this still seems like an issue with logical updatelessness rather than empirical updatelessness.
As an example: when I’m betting on the atomic number of iron, I shouldn’t think of myself as cooperating with versions of myself who live in a world where iron has 27 protons. Those worlds might not exist. But I’m cooperating with instances where the game-master decided to ask if iron has 25 or 26 protons.
As in: If you’re in a counterfactual mugging where Omega says they’d reward you in a world where iron has 27 protons if you pay in this world, then you pay because you expect there to be a bunch of Omegas elsewhere doing other logical counterfactual muggings. In roughly half of those cases, someone is about to get paid by omega if their impossible counterpart pays. And your action provides evidence that their impossible counterpart pays and that Omega gives them the reward.
And the same structure applies in the calculator example and the “conjunction of two theorems” example, because you’re correlated with a bunch of other distant people where you have so little information about the details of their situation that your epistemic position is “ex-ante” relative to their dilemma, so even if you’re updateful, you bet to optimize ex-ante utility in your case, to get evidence that they bet to optimize ex-ante utility in their case.
Hm, maybe that’s right.
Doesn’t that feel really unsatisfying though? I still feel like updateful EDT recommends wrong actions in important test cases. It’s just a contingent fact that most dilemmas like this will be small-scale in a large-scale universe, and that EDT’s recommendation to act as if you double-update will be swamped by not wanting to get evidence that other people elsewhere double-updates. And there’s always going to be that force pushing towards recommending double-updating, so if you ever get evidence that a decision is high-stakes enough and universal enough throughout the universe, EDT may well recommend that you make an exception for it and do a proper double-update on it, which seems bad.
I think that the situation needs to be quite extreme for my argument not to work. I think it’s quite likely I will never get to the point where I think that a decision is particularly high-stakes or universal in the grand scheme of things. I think it’s plausible that until negentropy runs out, I will always think that there is an even larger an more complicated distribution of logical counter-factual worlds out there that I haven’t explored yet, compared to which I’m only a tiny speck. So I think plausibly I will always think that I should bet 50-50 when I know nothing about something, because that’s the right policy overall.
I agree though that it’s not entirely impossible that I will come to a point where I no longer have uncertainty about what’s outside the distribution I already explored; I believe that my decision is very high stakes and doesn’t correlate with many other different decisions in my logical distribution; and I believe that worlds where T1 is false are so inconceivable that they can’t be part of my trade coalition of logically counter-factual worlds.
But I think that’s also the point where normal probabilities and betting rules entirely break down for me.
When I make a bet about a 1⁄4 probability even, I imagine it that I’m making decisions for four subagents, representing beliefs in the four different outcomes. Normally, when I bet on coinflips and other mundane questions, these four subagents love each other, and they are utilitarian about maximizing the sum of their resources. So they are okay with making bet on one outcome, which means transferring the money of three subagents to the fourth.
But if I believe that once I learn that T1 is true, I will consider in inconceivable that T1-false worlds can ever be part of my coalition, that’s a different situation. In that case, I think my T1-true and T1-false subagents don’t love each other and are indifferent to each other’s well-being. If I’m offered a bet, that’s equivalent to three subagent transferring their wealth to the fourth, and they will refuse to do that. So if I’m only offered one possible bet (betting on the conjunction of T1 and T2), I think I will bet one-fourth of my wealth on it, independently of the odds.
I agree this sounds a bit like an epicycle, but belief-representing subagents negotiating in a moral parliament is an important part of my world-view for other reasons too, (I will soon send a doc about this to you), so this solution feels quite natural to me. And it’s not like I otherwise have great intuitions about what to do at the point of meta-logical near-omniscience where I am able to tell that my current decision is high-stakes within the entire multiverse of logically counterfactual worlds.
These don’t seem like esoteric failures where it’s plausible that the hypothetical isn’t fair, or where the dynamic inconsistency is plausibly explained as being caused by changing values, or where there’s a fundamental tradeoff between the harms and value of new information. The errors happen in very mundane situations
Setting aside whether these count as “errors” (I basically agree with Jesse’s comment), I’m not sure why you think the above. Paul’s case involves this assumption: “We live in a very big universe where many copies of me all face the exact same decision. This seems plausible for a variety of reasons; the best one is accepting an interpretation of quantum mechanics without collapse (a popular view).”
In one sense this is mundane, since if you buy any of the large world theories, you’d think we’ve always been in such a large world. Maybe that’s what you mean. But it’s very non-mundane in the sense that these large worlds are still clearly out-of-distribution for our intuitions. We haven’t directly experienced the consequences of lots of copies of ourselves doing things. This is one big reason I’m pretty suspicious of my pre-theoretic intuitions about Paul’s case.
Here’s a story for how we could get lots of AI help with AI safety research even if schemers are somewhat common and diffuse control doesn’t work to get them to help us:
Let’s say a company successfully solves control for “high-stakes”/”concentrated” threats, meaning that AIs can safely be deployed without a large risk of them hacking their own datacenters or exfiltrating or causing any similar catastrophic action. (At a given level of capability.)
However, let’s say that control fails to solve the problem of getting the AIs to try hard at doing good AI safety research. When asked to do research, perhaps the schemers are sloppy, or perhaps they are actively trying to mislead us.
Still, the AI company can now safely train and deploy lots of different AIs without fear of them exfiltrating or hacking the lab. Which gives the AI company lots of “shots on goal” of training AIs that generalize to earnestly helping us with automated alignment research. Maybe they develop a zoo with like 20 separate countries of geniuses in data centres, including:
Maximally instruction-following AIs, instructed to help us in various ways.
Consequentialist AIs trained to care about human values.
Consequentialist AIs trained to play-the-training-game/scheme on behalf of human values. (So that they don’t get replaced by other schemers. Similar to inoculation prompting.)
AIs trained to just love alignment research (or some other project) in their heart. They’re just like a curious scientist who is obsessed with solving alignment.
AIs designed to be maximally amenable to making deals, and then offered some great payment in exchange for AI safety research progress.
AI intentionally trained to be reward-seekers — and told that they will get maximum reward if they’re eventually judged to have done excellent AI safety research.
...
Perhaps most of these end up scheming against the AI company — but some end up genuinely aligned.
The company probably can’t initially tell apart the schemers from the aligned AIs, or indeed, tell which of the projects are succeeding at a day-to-day basis. (It’s just too hard to evaluate which research directions are genuinely promising vs. a waste of time.) But they run them all in parallel, and after several months of massively sped-up work, it’s likely much easier to adjudicate whose solutions makes sense to adopt.
This is similar to how, in well-functioning scientific fields, it’s not typically possible to get consensus on what speculative research directions are most promising. But once a research project or whole research agenda is completed, there are normally ways in which the value-add can be demonstrated.
And maybe all that automated labor is enough to save the day.
Downsides with this approach:
If you split your compute across 20 different types of AIs, then your favorite type of AI is going to only have 5% as much compute available to it as it would’ve had if you concentrated your bets. This might be costly.
You’re increasing the risk of deploying at least one schemer. So you’re increasing the risk of high-stakes catastrophes like AIs hacking your data center. (Which is why it’s important that concentrated/high-stakes control is going well enough that the baseline risk is low.)
(Note that you could use a similar method of “wide variety of diverse AIs” as monitors, which could be very helpful for high-stakes control.)
On the bright side, you’re probably increasing your chances of catching at least one AI red-handed, which could be useful for research or for gathering legible evidence that risk is high.
With such evidence in hand, maybe you could convince others to coordinate to slow down, so that you don’t have to pursue the crazy plan of “deploy dozens of different AI models, because any given model is pretty likely to be scheming against you”.
Overall, thinking about this makes me:
Somewhat more bullish on the value of high-stakes/concentrated control research.
More bullish on developing somewhat dubious ideas for how to prevent scheming. Even if an idea might increase scheming risk in expectation (!) if it just has a bunch of uncertainty/randomness that’s independent from other approaches, then it could still add value.
In discussions about whether AI companies should write their model specs to say X or Y, I’m more inclined to say that, at least internally, they should hedge their bets by deploying models of both the X variant and the Y variant.
(Though depending on what the X/Y disagreement is, this might not resolve the problem. E.g. if the concern is that the X variant will be too easy to misuse by AI company insiders, then having the X model be available as one model might be almost as bad as it having it be the AI company’s only model.)
I explore similar considerations here in The Case for Mixed Deployment (5 min read), key takeaways were:
Invest in second-best techniques (basically what you say about “developing somewhat dubious ideas for how to prevent scheming”)
Modular systems—i.e. architectures where you can “toggle” some variable at inference-time to give you more variants, e.g. gradient routing
Prioritize late-stage techniques, i.e. techniques which apply late in AI lifecycle (e.g. post-training, prompting, activation steering, scaffolding) and deprioritise techniques apply early in the AI lifecycle (e.g. architecture design and pretraining). because this makes it cheaper to develop many variants
However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel. That is, I disagree with “If you split your compute across 20 different types of AIs, then your favorite type of AI is going to only have 5% as much compute available to it as it would’ve had if you concentrated your bets.” This is because I think we can run things like debate between all the variants.
One upshot of this difference is that I expect coordination between labs to matter significantly: if lab A starts with 100% schemers and lab B starts with 5% schemers, then we’ll elicit useful work from the AIs if the labs cross-examine each other’s AI research with their own AI variants.
Oh nice, thanks!
Yeah, I think the case for diverse monitors to achieve high-stakes control is actually a lot more straightforward. It’s easy to implement a monitor in a way such that it has very little ability to autonomously cause bad things to happen, so basically the only cost is the extra compute it takes to train and run it. And in the high-stakes/concentrated failures regime, humans will typically be able to recognize attempted catastrophes once pointed out to them.
I agree that debate-style approaches seem good. I do think it seems a lot more feasible to judge such debates once the AIs have already accomplished a lot of work, vs. early on in the research when AIs are making taste-driven judgments about direction, so it complements nicely with letting AIs do a lot of work in parallel also.
This seems like quite a difficult task, especially since we (or some other trusted agent) need to read the outputs of these AIs to get any value from their safety research. Schemers could self-exfiltrate by manipulating us somehow.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work. That’s a disanalogy with well-functioning scientific fields: scientists don’t deliberately pick bad research directions and try hard to make them seem good.
I think self-exfiltration via manipulation seems pretty hard. I think we’re likely to have transformatively useful systems that can’t do that, for some amount of time. (Especially since there’s no real reason to train them to be good at manipulation, though of course they might generalize from other stuff.) I agree people should definitely be thinking about it as a potential problem and try to mitigate and estimate the risk.
Some scientists do optimize for (something like) “impressiveness” of their work, regardless of whether it’s good or bad. It’s true that they don’t intentionally compromise impressiveness in order to make the work more misleading. That said, if some models optimize for impressiveness, and some models compromise impressiveness to make their work more misleading, then I guess the humans should be somewhat more likely to use the more impressive work? So maybe that limits the degree of influence that intentionally misleading work could I have.
I am definitely pretty worried about humans ability to judge what work is good or bad. In addition to the disanalogy you mention, I’d also highlight (i) non-experts typically struggle to judge which expert is right in a dispute, and if humans are slow & dumb enough relative to the AIs, humans might be analogous to non-experts, and (ii) it’s plausible to me that human science benefits a lot from the majority of scientists (and maybe especially the most capable scientists?) having at least a weak preference for truth, e.g. because it’s easier to recognize a few bad results when you can lean on the presumed veracity of most established results.
if you ask each variant to review the research of the other variants, then the schemers need to optimise their research for looking good to each variant. but the optimistic assumption is that at least one variant is an equally capable non-schemer.
Not necessarily. The nice AIs also need to be able to win a debate against the schemers, as judged by humans. It’s not enough for the variants to be able to recognize poor research if they can’t show their work in an unrefutable (by the schemers) way.
Here’s my current picture of EDT and UDT.
In situations where EDT agents have many copies or near-copies, an EDT agent operates by imagining that it simultaneously controls the decisions of all those copies. This works very elegantly as long as it optimizes with respect to its prior and (upon learning new information) just changes its beliefs about what people in the prior it can control the actions of. (I.e., when it sees a blue sky, it shouldn’t change its prior to exclude worlds without blue skies, but it should make its next decision to optimize argmax_U[EV_prior(U|”an agent like me who has seen a blue sky would take action A”)]
As described here, EDT agents will act in very strange ways if they also update their prior upon observing evidence. As described here, EDT agents will act similarly strangely if they update their logical priors upon deriving a proof. (Though the logical situation is somewhat less bad than the empirical situation.) These don’t seem like esoteric failures where it’s plausible that the hypothetical isn’t fair, or where the dynamic inconsistency is plausibly explained as being caused by changing values, or where there’s a fundamental tradeoff between the harms and value of new information. The errors happen in very mundane situations, and to me they look like dumb and surely-avoidable accounting errors.
Unfortunately, the most obvious solution to the problem is something like “stick with your prior and never change it”. And that’s not really available as a solution to a bounded agent like me.
I don’t have a complete and coherent prior of the world. I can procedurally generate beliefs when asked about them, and you could try to construct a complete set of beliefs (perhaps to be used as priors) out of this (e.g. you could say that I “currently believe X with probability p” if I would respond with probability p upon being asked about X and given 1 minute to think). But any such set of beliefs would be very contradictory and incoherent. And I suspect that EDT might not look as good if the prior it starts with is very contradictory and incoherent.[1]
This creates a bit of a puzzle. As someone who is sympathetic to EDT, I have arguments on the one hand that I shouldn’t update my prior on the basis of observations or proofs. (Indeed — arguments that suggest that such updates would lead to seemingly dumb and surely-avoidable accounting error.) But on the other hand, I need to do some sort of reasoning or learning to construct (or refine into coherence) my prior in the first place. I don’t currently know how to do this latter thing while avoiding the former thing.
Perhaps the most elegant thing would be to have an account of some particular kind of reasoning or learning that could be used to construct/refine priors, where we’d be able to show that it doesn’t run into the same issues as the ones we run into when we modify our beliefs in response to observations like this or in response to proofs like this. Or maybe it’s EDT that needs to change, or the idea of making decisions based on priors.
Misc points:
I think open-minded updatelessness is an attempt at something like this, where “awareness growth” is a separate type of operation from normal updating, and only awareness growth is allowed to modify the prior. I find it hard to evaluate because I don’t know how awareness growth is supposed to mechanically work.
I don’t think FDT obviously does better than EDT here.
You might hope that the question of “what’s the prior?” might turn out to not be so important, as long as we eventually receive enough evidence about what subsets of universes we have the most impact in. However, it looks to me like the prior may be extremely important, because it seems plausible to me that EDT recommends doing ECL from the perspective of the prior. If so, what values we benefit in the future may be primarily determined by their frequency in our prior.
I’m not sure though. Maybe there’s some minimum level of coherence that is sufficient to motivate reasoning from an ex-ante perspective. This report from Martin Soto tries to construct somewhat coherent and complete priors from logical inductors run for a finite amount of time, and then do updatelessness on the basis of them. That seems like a promising place to look for further insights.
I’ve seen this post cited a few times now in support of the idea that “EDT double counts”. But the betting behaviour in Calculator Bet seems just straightforwardly and intuitively correct to me (from an EDT perspective).
Here’s what seems like an analogous case: 0.5 prior on X being true. I look at a calculator that is 0.99 reliable. After I make my observation, if my observation is veridical (e.g., seeing “X is true” when X is actually true) then with 0.99 chance someone clones me N times, and with 0.01 chance nothing happens; if my observation is deceptive (e.g., seeing “X is true” when X is actually false) then with 0.01 chance someone clones me N times, and with 0.99 chance nothing happens.
In this case, like in the original Calculator Bet, I should “bet at 9999:1 odds”, but in both cases that’s because seeing “X is true” has two real updates: (i) X is probably true, and (ii) there are probably many clones of me. I feel like it’s because the odds are 0.99 in both parts that makes this feel like pathological “double counting”. If I make the calculator reliability Q and the subsequent duplication chance P, the behaviour here just seems unproblematically correct. (The fact that the 0.99 chance of cloning comes from i.i.d. experiments + law of large numbers in the Calculator Bet case and from a single randomisation device here should not matter.)
There is a separate point that EDT+(minimal reference class-SSA) is in general compatible with the ex ante/”UDT”-optimal policy and EDT+(anything that isn’t mrcSSA) is in general Dutch bookable. But that’s not what is going on here since mrcSSA does not mean “be updateless”/”use your prior in EU calculations”, it just means don’t anthropically update your prior based on how many observers are in different worlds.
[In the Calculator Bet case, if I additionally performed an SIA-update after seeing “X is true”, then I would be betting at 999999:1 odds, which would be very strange! But this is just the well-understood Dutch book against EDT+SIA from Briggs (2010).]
I don’t think that the updateful EDT behaviour in e.g. the calculator example is obviously problematic. Certainly not clearly worse than the alternative of just optimizing relative to the prior (cf. Anthony’s post).
I do think that the buy-and-copy behaviour from your example is bad, but it is bad because of how EDT manages the news, not because of the combination of EDT and anthropic updating per se. A counterfactual theory like FDT or TDT doesn’t manage the news and so doesn’t use the buy-and-copy strategy, AIUI. (Maybe similar cases could be constructed for counterfactual theories, though?)
To me, 1 and 2 suggest that we should consider a counterfactual theory (without going updateless relative to the prior), not just EDT + updateless relative to the prior.
In any case, I’m sympathetic to the ideal reflection principle, i.e., we should optimize our subjective expectation of the ideal agent’s expected utilities. So if you think the ideal agent is updateless relative to the prior, then you should make decisions based on your expectations of your expectations relative to this prior if you could compute it. (This includes the expected value of policies for handling reasoning/logical learning.) Of course it’s very unclear how to form such beliefs, but that doesn’t seem like a problem specific to updatelessness (i.e., it’s also unclear how to form beliefs about an updateful ideal agent).
Are there cases where EDT manages the news for reasons other than anthropic updating? I’m not aware of any, and if not then it this is exactly because of the interaction of EDT and anthropic updating.
To me, “managing the news” is just a description of how EDT works in general (i.e., EDT is definitionally about picking the action that gives us the best news). And I think EDT is problematic for that fundamental reason. I just think that Lukas’ case makes the silliness of news-management particularly vivid. (Other cases which arguably do so are XOR blackmail and this case.)
(I do think that Lukas’ case gets some of its counterintuitiveness from the fact that SIA has us weight worlds in proportion to how many copies of us they contain. But, again, that is just a counterintuitive property of SIA in general, which I think we ought to evaluate independently of how it interacts with decision theory.)
If “managing the news” just means “making a decision in situation X such that you are glad to hear the news that you made that decision in situation X,” then I agree that’s a description of EDT. I think it’s a priori reasonable to manage news about what you decide to do, so I don’t see this as a fundamental reason that EDT is problematic. I usually associate the phrase with various intuitive mistakes that EDT might make, and then I want to discuss concrete cases (like Lukas’) in which it appears an agent did something wrong.
More precisely: I think these failures probably enables dutch books that CDT isn’t susceptible to.
So it’s not just that insufficient updatelessness fails to capture some potential value that you could have gotten if you were more updateless. It’s that it’s actively worse than CDT in some cases.
And that’s a large part of why I now feel more unhappy about EDT. I feel like I have a better sense of what my beliefs are than what my prior is. If EDT requires me to act according to my prior (and will lead to me making stupid decisions if I instead act according to changing beliefs) then I’m not sure exactly how to do that.
In your example, you know T1 and your counterpart knows T2. You see your behavior as correlated with the behavior of your counterpart. Under these conditions, it seems like T1 can’t possibly be so fundamental that you need it in order to do EDT-style reasoning (otherwise your counterpart couldn’t). So even if you grant that you need some beliefs to get EDT off the ground, it seems like those beliefs must be in the intersection of T1 and T2, in which case you wouldn’t run into this problem.
I think there are a lot of pretty basic open questions about how logically bounded minds work, especially if they want them to be philosophically principled all the way down. That’s confusing enough that it’s totally plausible that you’ll come out of it with some TDT-like story, where causality is a basic cognitive ingredient that makes the bootstrapping work at all. My guess is that in the end causality will just be a helpful language for talking about certain kinds of independence relationships rather than a fundamental cognitive building block or something with direct normative implications, but who knows.
(Note for observers: I’m not expecting us to resolve any of those questions as part of AI safety.)
So my current impression is basically that you’re optimistic about something similar to this:
And that your argument here is an argument for why it won’t be possible to make double-update arguments about this more narrow type of beliefs or reasoning. (Because it’s so fundamental that if it differs between agents, they can’t be correlated with each other.)
That argument seems plausible but not very robust to me at the moment. I mostly don’t have a great sense of what the “some beliefs to get [otherwise updateless] EDT off the ground”-paradigm looks like. Maybe I’ll look into it more.
I might be missing something, but the situation doesn’t seem that bad to me.
I tentatively think that we should bullet-bite and be logically and empirically updateless EDT. Admittedly, it’s rough that I don’t really know how my prior looks like, but I think we can deal with that.
The way I imagine it is that following ECL comes in two separate stages. First, we need to take the best actions under ECL that we can as confused mortals. Second, we need need to decide how to use the resources of the universe once we are already surrounded by superintelligent AI advisors and had time to do a Long Reflection.
I think during the mortal phase, it’s okay that we don’t understand the prior very well. I think the only axioms we need to be in place before we become logically updateless are “it’s generally good if agents within the prior try to pursue their goals” and “we should try to follow ECL”. After these are in place, I think the best overall policy is for each agent to look around in their world and try to figure out how to get more optionality for themselves and the ECL coalition, while behaving in a way that feels like it should have robustly good correlated actions across the multiverse.
In our case, we can increase optionality by trying to make sure the stars don’t pointlessly burn out; agents following ECL eventually get in control of a big chunk of the world; there are good processes for growth and reflection. Meanwhile, it’s probably good to do less lying and backstabbing, and be merciful towards the weak, because these seem somewhat likely to have correlations with other actions across the multiverse that makes it more likely that ECL agents can follow their goals.
Occasionally, there is a toy example like Sleeping Beauty or the ones you listed in your post where we can explicitly reason about EDT, but usually we just follow the heuristics I described above. I don’t think we have a right to hope for much more clarity as mortals, and thankfully none of this requires knowing how the prior looks like.
In the second stage, when we stabilized the situation, figured out some good growth process, have superintelligent advisors and sent out probes to put out the fires of the distant stars, we still need to decide what we do with the resources. It seems plausible that, as you say “what values we benefit in the future may be primarily determined by their frequency in our prior”.
But figuring out the frequency of different beings in our prior doesn’t seem to me an especially intractable problem compared to the already scary question of what values we want to terminally pursue on reflection.
One idea I like for figuring out the prior is building a bigger and bigger coalition in an onion-like manner. We first run simulations of nearby quantum branches and pull out the cooperative people and AIs from there. We learn to live together, share our different perspectives, then we decide on a next broader distribution of quantum branches that we want to trade with, and so on.
I think logical updatelessness commits us to trying to trade with the logically counter-factual beings too at the end. That requires finding a Schelling-point distribution of logically counter-factual universes that the other universes will also agree on. But by the time we get there, we have already reflected on a lot of things, and learned to live together with dinosaur-people we pulled out of simulations. I feel that this giant coalition of humans, AIs and dinosaur-people, all using superintellignet advisors, can pull their ideas together and figure out how they imagine the logical prior and who they should trade with next. This doesn’t feel harder to me than other questions about the meaning of life that we will need to deal with.
In many ways I like the baseline strategy of “ignore decision theory, act in ways that heuristically seem like they gather option-value, figure out all the hard stuff with the help of superintelligence”.
I guess your proposal is similar to that except there’s an addition of “we have a hunch that something like ECL works and that this means we should be a bit more cooperative, so we’ll be a bit more cooperative”.
But for some purposes, it does seem useful to know the implications of decision theory. A few examples (some more important than others):
Is there categories of information that people can plausibly be harmed by learning, that we should try to avoid, or will it clearly be correct (on reflection) to be retroactively updateless?
Understand more detailed implications of ECL, like who we should cooperate with and how much.
Should we do some crazy DT-motivated pseudo-alignment scheme like this or this.
For the purposes of answering questions like this, I’m interested in whether I basically buy EDT (and with what kind of updatelessness) or if the real answer to DT is going to be pretty different.
One concern here is that EDT is maybe less of coherent or appropriate decision theory than I thought. I don’t really think your plan addresses that. Like, your plan talks about how we’ll have lots of resources to think about what our prior should be, but doesn’t really address this part:
I don’t know, I still don’t see this as that bad of a sign for EDT. Yes, in the far future you will need to trade with people in you confusing and incoherent prior over logics.
But I think this is basically equivalent to handling the “what if you are in a simulation where the simulators intentionally messed with your brain to believe false things about certain logical statements” question. Admittedly, it’s a hard question, but I think everyone will need to deal with something like this.
Maybe I’m confused about how much you believe that my actual life history matters. I think in the case of empirical updatelessness, my life history doesn’t really matter—I will eventually try to trade with people in proportion to something like their measure in the Solomonoff prior, and not with worlds where Austria and Australia are the same country, even though I was uncertain about this empirical fact when I was 5. (Do you agree with this, or do you think life history also matters for empirically updateless trade?)
I expect that logical updatelessness is similar—I will try to use some elegant construction like the Solomonoff prior to put a weight on different logical counterfactuals, and it won’t matter how my prior was constructed in my childhood.
If you were 100% empirically updateless and 0% logically updateless, then I think your life history wouldn’t matter except insofar as it led you to learning different logical facts. Insofar as you would eventually reach ‘logical maturity’ regardless of your life history, and learn some vast and similar swaths of logical facts, then yeah, eventually your life history wouldn’t matter.
My current understanding is that this requires updating on a bunch of logical arguments (about what that elegant construction should be and what it implies) and can therefore get you dutch-booked.
Yes, I agree the elegant construction will need to rely on some logical arguments, but I think tha’t not that bad.
The way I imagine trade to work is that I propose a distribution of chips among different universes which I would be happy to trade under. For example “every universe in the quantum multiverse each getting chips proportional to what the Born-rule prescribes” is a system I would be happy to trade under. Then I can see which other universes are willing to trade under this chip-distribution, and then we trade with each other using our chips.
I think this extends to trade among logically counter-factual worlds at least in some toy-examples. If an important historical event turned on someone making a bet on the billionth digit of pi, then the logically counterfactual worlds which were identical except the billionth digit of was different can probably make a trade deal among each other because they can all imagine this narrow logically counter-factual distribution, and they all recognize it as a Schelling-point.
I think we can probably go broader than that, and figure out an elegant distribution of chips among logical universes which a) we find fair in the sense that we value the resources of the other universes in proportion to their chips (just like in the Born-rule case), so we are happy to trade under this distribution and b) we think that many other universes in the distribution will share enough core features of our logic to recognize this distribution of chips as an elegant Schelling-point, and something they consider fair under their values. Then we trade with everyone who is willing to trade. Once we are done, we try to construct a broader coalition.
I think it’s likely that we won’t be able to expand the trade coalitions enough to cover all possible logic systems (what would that even mean?) Maybe we will never be able to deal with the guys living in the 1+1=3 universe, because we can’t imagine them, they can’t imagine us, and there is no distribution we both recognize as elegant and fair. In that case, we will leave some value on the table by not being able to trade with each other, but that’s life. I don’t think this means that we need to throw out this decision theory—it was nice enough if we got beneficial trades in as broad circles as we could.
---
Most concretely, I don’t see how you get Dutch-booked here. I tentatively think that any betting that a malicious Dutchman can come up with to get money out of me will be based on simple enough logical counter-factuals that I can form a mutually recognized distribution of logic systems among the affected parties and we will be fine.
Can you give a concrete example of how someone can pump money out of me?
Thinking more about it, I think I don’t stand by my original reply. It seems possible to have some theorems whose result I currently feel 50-50 about, but which are important enough that I’m at least uncertain if I will ever be able to build a broad enough coalition of logically counter-factual beings that include people where the opposite of the theorem is true.
I think the same problem arises for some empirical questions too—T1 and T2 can be questions like “is iron’s atomic number 26 or 27?” I would have been roughly 50-50 before looking it up, but I’m uncertain if I should try to cooperate with people living in worlds where the atomic number of iron is 27 - I don’t know if those worlds are compatible with life.
However, thinking through these examples, I think I now reject the premise that updateful EDT bets wrongly in your example of the two theorems or in Paul’s original calculator example.
I think in both cases the decision-correlational reference class you should take into account is not just you learning T1 is true and you learning T2 is true within this particular experiment. It’s every instance across the multiverse where beings similar to you need to make bets about questions they have no clue about. Taking all these correlations into account, the correct thing to do is to bet with 50-50.
(As an example: when I’m betting on the atomic number of iron, I shouldn’t think of myself as cooperating with versions of myself who live in a world where iron has 27 protons. Those worlds might not exist. But I’m cooperating with instances where the game-master decided to ask if iron has 25 or 26 protons.)
Separately, at the end of the days, I still want to do acausal trade with a broad coalition of worlds which might or might not include ones where iron has 27 protons and the T1 theorem is false. But I now think that this is a separate question, and updatelessness might not be required in our mortal life.
Minor: The question about whether those worlds are compatible with life seems like a logical rather than empirical question to me. So this still seems like an issue with logical updatelessness rather than empirical updatelessness.
As in: If you’re in a counterfactual mugging where Omega says they’d reward you in a world where iron has 27 protons if you pay in this world, then you pay because you expect there to be a bunch of Omegas elsewhere doing other logical counterfactual muggings. In roughly half of those cases, someone is about to get paid by omega if their impossible counterpart pays. And your action provides evidence that their impossible counterpart pays and that Omega gives them the reward.
And the same structure applies in the calculator example and the “conjunction of two theorems” example, because you’re correlated with a bunch of other distant people where you have so little information about the details of their situation that your epistemic position is “ex-ante” relative to their dilemma, so even if you’re updateful, you bet to optimize ex-ante utility in your case, to get evidence that they bet to optimize ex-ante utility in their case.
Hm, maybe that’s right.
Doesn’t that feel really unsatisfying though? I still feel like updateful EDT recommends wrong actions in important test cases. It’s just a contingent fact that most dilemmas like this will be small-scale in a large-scale universe, and that EDT’s recommendation to act as if you double-update will be swamped by not wanting to get evidence that other people elsewhere double-updates. And there’s always going to be that force pushing towards recommending double-updating, so if you ever get evidence that a decision is high-stakes enough and universal enough throughout the universe, EDT may well recommend that you make an exception for it and do a proper double-update on it, which seems bad.
I think that the situation needs to be quite extreme for my argument not to work. I think it’s quite likely I will never get to the point where I think that a decision is particularly high-stakes or universal in the grand scheme of things. I think it’s plausible that until negentropy runs out, I will always think that there is an even larger an more complicated distribution of logical counter-factual worlds out there that I haven’t explored yet, compared to which I’m only a tiny speck. So I think plausibly I will always think that I should bet 50-50 when I know nothing about something, because that’s the right policy overall.
I agree though that it’s not entirely impossible that I will come to a point where I no longer have uncertainty about what’s outside the distribution I already explored; I believe that my decision is very high stakes and doesn’t correlate with many other different decisions in my logical distribution; and I believe that worlds where T1 is false are so inconceivable that they can’t be part of my trade coalition of logically counter-factual worlds.
But I think that’s also the point where normal probabilities and betting rules entirely break down for me.
When I make a bet about a 1⁄4 probability even, I imagine it that I’m making decisions for four subagents, representing beliefs in the four different outcomes. Normally, when I bet on coinflips and other mundane questions, these four subagents love each other, and they are utilitarian about maximizing the sum of their resources. So they are okay with making bet on one outcome, which means transferring the money of three subagents to the fourth.
But if I believe that once I learn that T1 is true, I will consider in inconceivable that T1-false worlds can ever be part of my coalition, that’s a different situation. In that case, I think my T1-true and T1-false subagents don’t love each other and are indifferent to each other’s well-being. If I’m offered a bet, that’s equivalent to three subagent transferring their wealth to the fourth, and they will refuse to do that. So if I’m only offered one possible bet (betting on the conjunction of T1 and T2), I think I will bet one-fourth of my wealth on it, independently of the odds.
I agree this sounds a bit like an epicycle, but belief-representing subagents negotiating in a moral parliament is an important part of my world-view for other reasons too, (I will soon send a doc about this to you), so this solution feels quite natural to me. And it’s not like I otherwise have great intuitions about what to do at the point of meta-logical near-omniscience where I am able to tell that my current decision is high-stakes within the entire multiverse of logically counterfactual worlds.
Setting aside whether these count as “errors” (I basically agree with Jesse’s comment), I’m not sure why you think the above. Paul’s case involves this assumption: “We live in a very big universe where many copies of me all face the exact same decision. This seems plausible for a variety of reasons; the best one is accepting an interpretation of quantum mechanics without collapse (a popular view).”
In one sense this is mundane, since if you buy any of the large world theories, you’d think we’ve always been in such a large world. Maybe that’s what you mean. But it’s very non-mundane in the sense that these large worlds are still clearly out-of-distribution for our intuitions. We haven’t directly experienced the consequences of lots of copies of ourselves doing things. This is one big reason I’m pretty suspicious of my pre-theoretic intuitions about Paul’s case.