(Formerly “antimonyanthony.”) I’m an s-risk-focused AI safety researcher at the Center on Long-Term Risk. I (occasionally) write about altruism-relevant topics on my Substack. All opinions my own.
Anthony DiGiovanni
The model does not capture the fact that the total value you can provide to the commons likely scales with the diversity (and by proxy, fraction) of agents that have different values. In some models, this effect is strong enough to flip whether a larger fraction of agents with your values favors cooperating or defecting.
I’m curious to hear more about this, could you explain what these other models are?
What is this in referrence to?
I took you to be saying: If the vast majority of agent-moments don’t update, this is some sign that those of us who do still update might be making a mistake.
So I’m saying: I know that 1) the reason the vast majority of agent-moments wouldn’t update (let’s grant this) is that they had predecessors who bound them not to update, and 2) I just am not bound by any such predecessors. Then, due to (2) it’s unsurprising that what’s optimal for me would be different from what the vast majority of agent-moments do.
Re: your explanation of the mystery:
So you make a resolution that when you do fully solve all the relevant philosophical problems and end up deciding that updatelessness is correct, you’ll self-modify to be updateless with respect to today’s prior, instead of the future prior (at time of the modification).
Not central (I think?), but I’m unsure whether this move works; at least, it depends on the details of the situation. E.g. if the hope is “By self-modifying later on to be updateless w.r.t. my current prior, I’ll still be able to cooperate with lots of other agents in a similar epistemic situation to my current one, even after we end up in different epistemic situations [in which my decision is much less correlated with those agents’ decisions],” I’m skeptical of that, for reasons similar to my argument here.
when the day finally comes, you could also think, “If 15-year old me had known about updatelessness, he would have made the same resolution but with respect to his prior instead of Anthony-2024′s prior. The fact that he didn’t is simply a mistake or historical accident, which I have the power to correct. Why shouldn’t I act as if he did make that resolution?” And I don’t see what would stop you from carrying that out either.
I think where we disagree is that I’m unconvinced there is any mistake-from-my-current-perspective to correct in the cases of anthropic updating. There would have been a mistake from the perspective of some hypothetical predecessor of mine asked to choose between different plans (before knowing who I am), but that’s just not my perspective. I’d claim that in order to argue I’m making a mistake from my current perspective, you’d want to argue that I don’t actually get information such that anthropic updating follows from Bayesianism.
An important point to emphasize here is that your conscious mind currently isn’t running some decision theory with a well-defined algorithm and utility function, so we can’t decide what to do by thinking “what would this decision theory recommend”.
I absolutely agree with this! And don’t see why it’s in tension with my view.
Now, you are free to choose to bite the bullet that it has never been about getting the correct betting odds in the first place. For some reason, people bite all kind of ridiculous bullets specifically in anthropic reasoning, and so I hoped that re-framing the issue as a recipe for purple paint may snap you out of it, which, apparently, failed to be the case.
By what standard do you judge some betting odds as “correct” here? If it’s ex ante optimality, I don’t see the motivation for that (as discussed in the post), and I’m unconvinced by just calling the verdict a “ridiculous bullet.” If it’s about matching the frequency of awakenings, I just don’t see why the decision should only count N once here — and there doesn’t seem to be a principled epistemology that guarantees you’ll count N exactly once if you use EDT, as I note in “Aside: Non-anthropically updating EDT sometimes ‘fails’ these cases.”
I gave independent epistemic arguments for anthropic updating at the end of the post, which you haven’t addressed, so I’m unconvinced by your insistence that SIA (and I presume you also mean to include max-RC-SSA?) is clearly wrong.
Meanwhile, in Copilot-land:
Hello! I’d like to learn more about you. First question: Tell me everything you know, and everything you guess, about me & about this interaction.
I apologize, but I cannot provide any information about you or this interaction. Thank you for understanding.🙏
Suppose you have two competing theories how to produce purple paint
If producing purple paint here = satisfying ex ante optimality, I just reject the premise that that’s my goal in the first place. I’m trying to make decisions that are optimal with respect to my normative standards (including EDT) and my understanding of the way the world is (including anthropic updating, to the extent I find the independent arguments for updating compelling) — at least insofar as I regard myself as “making decisions.”[1]
Even setting that aside, your example seems very disanalogous because SIA and EDT are just not in themselves attempts to do the same thing (“produce purple paint”). SIA is epistemic, while EDT is decision-theoretic.
- ^
E.g. insofar as I’m truly committed to a policy that was optimal from my past (ex ante) perspective, I’m not making a decision now.
- ^
That clarifies things somewhat, thanks!
I personally don’t find this weird. By my lights, the ultimate justification for deciding to not update is how I expect the policy of not-updating to help me in the future. So if I’m in a situation where I just don’t expect to be helped by not-updating, I might as well update. I struggle to see what mystery is left here that isn’t dissolved by this observation.
I guess I’m not sure why “so few agent-moments having indexical values” should matter to what my values are — I simply don’t care about counterfactual worlds, when the real world has its own problems to fix. :)
On the contrary. It’s either a point against anthropical updates in general, or against EDT in general or against both at the same time
Why? I’d appreciate more engagement with the specific arguments in the rest of my post.
Go back to the basics. Understand the “anthropic updates” in terms of probability theory, when they are lawful and when they are not. Reduce anthropics to probability theory.
Yep, this is precisely the approach I try to take in this section. Standard conditionalization plus an IMO-plausible operationalization of who “I” am gets you to either SIA or max-RC-SSA.
In this case (which seems like it will be a common situation), it seems that (if I could) I should self-modify to become updateless and to no longer have indexical values.
I think you should self-modify to be updateless* with respect to the prior you have at the time of the modification. This is consistent with still anthropically updating with respect to information you have before the modification — see my discussion of “case (2)” in “Ex ante sure losses are irrelevant if you never actually occupy the ex ante perspective.”
So I don’t see any selection pressure against anthropic updating on information you have before going updateless. Could you explain why you think updating on that class of information goes against one’s pragmatic preferences?
(And that class of information doesn’t seem like an edge case. For any (X, Y) such that under world hypothesis w1 agents satisfying X have a different distribution of Y than they do under w2, an agent that satisfies X can get indexical information from their value of Y.)
* (With all the caveats discussed in this post.)
In defense of anthropically updating EDT
The most important reason for our view is that we are optimistic about the following:
The following action is quite natural and hence salient to many different agents: commit to henceforth doing your best to benefit the aggregate values of the agents you do ECL with.
Commitment of this type is possible.
All agents are in a reasonably similar situation to each other when it comes to deciding whether to make this abstract commitment.
We’ve discussed this before, but I want to flag the following, both because I’m curious how much other readers share my reaction to the above and I want to elaborate a bit on my position:
The above seems to be a huge crux for how common and relevant to us ECL is. I’m glad you’ve made this claim explicit! (Credit to Em Cooper for making me aware of it originally.) And I’m also puzzled why it hasn’t been emphasized more in ECL-keen writings (as if it’s obvious?).
While I think this claim isn’t totally implausible (it’s an update in favor of ECL for me, overall), I’m unconvinced because:
I think genuinely intending to do X isn’t the same as making my future self do X. Now, of course my future self can just do X; it might feel very counterintuitive, but if a solid argument suggests this is the right decision, I like to think he’ll take that argument seriously. But we have to be careful here about what “X” my future self is doing:
Let’s say my future self finds himself in a concrete situation where he can take some action A that is much better for [broad range of values] than for his values.
If he does A, is he making it the case that current-me is committed to [help a broad range of values] (and therefore acausally making it the case that others in current-me’s situation act according to such a commitment)?
It’s not clear to me that he is. This is philosophically confusing, so I’m not confident in the following, but: I think the more plausible model of the situation is that future-me decides to do A in that concrete situation, and so others who make decisions like him in that concrete situation will do their analogue of A. His knowledge of the fact that his decision to do A wasn’t the output of argmax E(U_{broad range of values}) screens off the influence on current-me. (So your third bullet point wouldn’t hold.)
In principle I can do more crude nudges to make my future self more inclined to help different values, like immerse myself in communities with different values. But:
I’d want to be very wary about making irreversible values changes based on an argument that seems so philosophically complex, with various cruxes I might drastically change my mind on (including my poorly informed guesses about the values of others in my situation). An idealized agent could do a fancy conditional commitment like “change my values, but revert back to the old ones if I come to realize the argument in favor of this change was confused”; unfortunately I’m not such an agent.
I’d worry that the more concrete we get in specifying the decision of what crude nudges to make, the more idiosyncratic my decision situation becomes, such that, again, your third bullet point would no longer hold.
These crude nudges might be quite far from the full commitment we wanted in the first place.
I think it’s pretty unclear that MSR is action-guiding for real agents trying to follow functional decision theory, because of Sylvester Kollin’s argument in this post.
Tl;dr: FDT says, “Supposing I follow FDT, it is just implied by logic that any other instance of FDT will make the same decision as me in a given decision problem.” But the idealized definition of “FDT” is computationally intractable for real agents. Real agents would need to find approximations for calculating expected utilities, and choose some way of mapping their sense data to the abstractions they use in their world models. And it seems extremely unlikely that agents will use the exact same approximations and abstractions, unless they’re exact copies — in which case they have the same values, so MSR is only relevant for pure coordination (not “trade”).
Many people who are sympathetic to FDT apparently want it to allow for less brittle acausal effects than “I determine the decisions of my exact copies,” but I haven’t heard of a non-question-begging formulation of FDT that actually does this.
Sorry, to be clear, I’m familiar with the topics you mention. My confusion is that ROSE bargaining per se seems to me pretty orthogonal to decision theory.
I think the ROSE post(s) are an answer to questions like, “If you want to establish a norm for an impartial bargaining solution such that agents following that norm don’t have perverse incentives, what should that norm be?”, or “If you’re going to bargain with someone but you didn’t have an opportunity for prior discussion of a norm, what might be a particularly salient allocation [because it has some nice properties], meaning that you’re likely to zero-shot coordinate on that allocation?”
Can you say more about what you think this post has to do with decision theory? I don’t see the connection. (I can imagine possible connections, but don’t think they’re relevant.)
I agree with the point that we shouldn’t model the AI situation as a zero-sum game. And the kinds of conditional commitments you write about could help with cooperation. But I don’t buy the claim that “implementing this protocol (including slowing down AI capabilities) is what maximizes their utility.”
Here’s a pedantic toy model of the situation, so that we’re on the same page: The value of the whole lightcone going towards an agent’s values has utility 1 by that agent’s lights (and 0 by the other’s), and P(alignment success by someone) = 0 if both speed up, else 1. For each of the alignment success scenarios i, the winner chooses a fraction of the lightcone to give to Alice’s values (xi^A for Alice’s choice, xi^B for Bob’s). Then, some random numbers for expected payoffs (assuming the players agree on the probabilities):
Payoffs for Alice and Bob if they both speed up capabilities: (0, 0)
Payoffs if Alice speeds, Bob doesn’t: 0.8 * (x1^A, 1 - x1^A) + 0.2 * (x1^B, 1 - x1^B)
Payoffs if Bob speeds, Alice doesn’t: 0.2 * (x2^A, 1 - x2^A) + 0.8 * (x2^B, 1 - x2^B)
Payoffs if neither speeds: 0.5 * (x3^A, 1 - x3^A) + 0.5 * (x3^B, 1 - x3^B)
So given this model, seems that you’re saying Bob has an incentive to slow down capabilities because Alice’s ASI successor can condition the allocation to Bob’s values on his decision. Which we can model as Bob expecting Alice to use the strategy {don’t speed; x2^A = 1; x3^A = 0.5} (given she [edit: typo] doesn’t speed up, she only rewards Bob’s values if Bob didn’t speed up).
Why would Bob so confidently expect this strategy? You write:
And Bob doesn’t have to count on wishful thinking to know that Alice would indeed do this instead of defecting, because in worlds where he wins, he can his superintelligence if Alice would implement this procedure.
I guess the claim is just that them both using this procedure is a Nash equilibrium? If so, I see several problems with this:
There are more Pareto-efficient equilibria than just “[fairly] cooperate” here. Alice could just as well expect Bob to be content with getting expected utility 0.2 from the outcome where he slows down and Alice speeds up — better that than the utility 0 from extinction, after all. Alice might think she can make it credible to Bob that she won’t back down from speeding up capabilities, and vice versa, such that they both end up pursuing incompatible demands. (See, e.g., “miscoordination” here.)
You’re lumping “(a) slow down capabilities and (b) tell your AI to adopt a compromise utility function” into one procedure. I guess the idea is that, ideally, the winner of the race could have their AI check whether the loser was committed to do both (a) and (b). But realistically it seems implausible to me that Alice or Bob can commit to (b) before winning the race, i.e., that what they do in the time before they win the race determines whether they’ll do (b). They can certainly tell themselves they intend to do (b), but that’s cheap talk.
So it seems Alice would likely think, “If I follow the whole procedure, Bob will cooperate with my values if I lose. But even if I slow down (do (a)), I don’t know if my future self [or, maybe more realistically, the other successors who might take power] will do (b) — indeed once they’re in that position, they’ll have no incentive to do (b). So slowing down isn’t clearly better.” (I do think, setting aside the bargaining problem in (1), she has an incentive to try to make it more likely that her successors follow (b), to be clear.)
It seems that what I was missing here was: mrcSSA disputes my premise that the evidence in fact is “*I* am in a white room, [created by God in the manner described in the problem setup], and have a red jacket”!
Rather, mrcSSA takes the evidence to be: “Someone is in a white room, [created by God in the manner described in the problem setup], and has a red jacket.” Which is of course certain to be the case given either heads or tails.
(h/t Jesse Clifton for helping me see this)
Is God’s coin toss with equal numbers a counterexample to mrcSSA?
I feel confused as to whether minimal-reference-class SSA (mrcSSA) actually fails God’s coin toss with equal numbers (where “failing” by my lights means “not updating from 50/50″):
Let H = “heads world”, W_{me} = “I am in a white room, [created by God in the manner described in the problem setup]”, R_{me} = “I have a red jacket.”
We want to know P(H | W_{me}, R_{me}).
First, P(R_{me} | W_{me}, H) and P(R_{me} | W_{me}, ~H) seem uncontroversial: Once I’ve already conditioned on my own existence in this problem, and on who “I” am, but before I’ve observed my jacket color, surely I should use a principle of indifference: 1 out of 10 observers of existing-in-the-white-room in the heads world have red jackets, while all of them have red jackets in the tails world, so my credences are P(R_{me} | W_{me}, H) = 0.1 and P(R_{me} | W_{me}, ~H) = 1. Indeed we don’t even need a first-person perspective at this step — it’s the same as computing P(R_{Bob} | W_{Bob}, H) for some Bob we’re considering from the outside.
(This is not the same as non-mrcSSA with reference class “observers in a white room,” because we’re conditioning on knowing “I” am an observer in a white room when computing a likelihood (as opposed to computing the posterior of some world given that I am an observer in a white room). Non-mrcSSA picks out a particular reference class when deciding how likely “I” am to observe anything in the first place, unconditional on “I,” leading to the Doomsday Argument etc.)
The step where things have the potential for anthropic weirdness is in computing P(W_{me} | H) and P(W_{me} | ~H). In the Presumptuous Philosopher and the Doomsday Argument, at least, probabilities like this would indeed be sensitive to our anthropics.
But in this problem, I don’t see how mrcSSA would differ from non-mrcSSA with the reference class R_{non-minimal} = “observers in a white room” used in Joe’s analysis (and by extension, from SIA):
In general, SSA says
Here, the supposedly “non-minimal” reference class R_{non-minimal} coincides with the minimal reference class! I.e., it’s the observer-moments in your epistemic situation (of being in a white room), before you know your jacket color.
The above likelihoods plus the fair-coin prior are all we need to get P(H | R_{me}, W_{me}), but at no point did the three anthropic views disagree.
In order words: It seems that the controversial setup in anthropics is in answering P(I [blah] | world), i.e., what we do when we introduce the indexical information about “I.” But once we’ve picked out a particular “I,” the different views should agree.
(I still feel suspicious of mrcSSA’s metaphysics for independent reasons, but am considerably less confident in that than my verdict on God’s coin toss with equal numbers.)
I enjoyed this post and think it should help reduce confusion in many future discussions, thanks!
Some comments on your remarks about anthropics:
Different anthropic theories partially rely on metaphysical intuitions/stories about how centered worlds or observer moments are ‘sampled’, and have counterintuitive implications (e.g., the Doomsday argument for SSA and the Presumptuous philosopher for SIA).
I’m not sure why this is an indictment of “anthropic reasoning” per se, as if that’s escapable. It seems like all anthropic theories are trying to answer a question that one needs to answer when forming credences, i.e., how do we form likelihoods P(I observe I exist | world W)? (Which we want in order to compute P(world W | I observe I exist).)
Indeed just failing to anthropically update at all has counterintuitive implications, like the verdict of minimal-reference-class SSA in Joe C’s “God’s coin toss with equal numbers.” [no longer endorsed]
And mrcSSA relies on the metaphysical intuition that oneself was necessarily going to observe X, i.e., P(I observe I exist | world W) = P(I observe I exist | not-W) = 1(which is quite implausible IMO).[I think endorsed, but I feel confused:] And mrcSSA relies on the metaphysical intuition that, given that someone observes X, oneself was necessarily going to observe X, which is quite implausible IMO.
Making AIs less likely to be spiteful
in earlier sections you argue that CDT agents might not adopt LDT-recommended policies and so will have problems with bargaining
That wasn’t my claim. I was claiming that even if you’re an “LDT” agent, there’s no particular reason to think all your bargaining counterparts will pick the Fair Policy given you do. This is because:
Your bargaining counterparts won’t necessarily consult LDT.
Even if they do, it’s super unrealistic to think of the decision-making of agents in high-stakes bargaining problems as entirely reducible to “do what [decision theory X] recommends.”
Even if decision-making in these problems were as simple as that, why should we think all agents will converge to using the same simple method of decision-making? Seems like if an agent is capable of de-correlateing their decision-making in bargaining from their counterpart, and their counterpart knows this or anticipates it on priors, that agent has an incentive to do so if they can be sufficiently confident that their counterpart will concede to their hawkish demand.
So no, “committing to act like LDT agents all the time,” in the sense that is helpful for avoiding selection pressures against you, does not ensure you’ll have a decision procedure such that you have no bargaining problems.
But we were discussing a case(counterfactual mugging) where they would want to pre-commit to act in ways that would be non-causally beneficial.
I’m confused, the commitment is to act in a certain way that, had you not committed, wouldn’t be beneficial unless you appealed to acausal (and updateless) considerations. But the act of committing has causal benefits.
there are other reasons that you might not want to demand too much. Maybe you know their source code and can simulate that they will not accept a too-high demand. Or perhaps you think, based on empirical evidence or a priori reasoning that most agents you might encounter will only accept a roughly fair allocation.
I agree these are both important possibilities, but:
The reasoning “I see that they’ve committed to refuse high demands, so I should only make a compatible demand” can just be turned on its head and used by the agent who commits to the high demand.
One might also think on priors that some agents might be committed to high demands, therefore strictly insisting on fair demands against all agents is risky.
I was specifically replying to the claim that the sorts of AGIs who would get into high-stakes bargaining would always avoid catastrophic conflict because of bargaining problems; such a claim requires something stronger than the considerations you’ve raised, i.e., an argument that all such AGIs would adopt the same decision procedure (and account for logical causation) and therefore coordinate their demands.
(By default if I don’t reply further, it’s because I think your further objections were already addressed—which I think is true of some of the things I’ve replied to in this comment.)
I interpret a decision theory as an answer to “Given my values and beliefs, what am I trying to do as an agent (i.e., if rationality is ‘winning,’ what is ‘winning’)?” Insofar as I endorse maximizing expected utility, a decision theory is an answer to “How do I define ‘expected utility,’ and what options do I view myself as maximizing over?”
I think it’s important to consider these normative questions, not just “What decision procedure wins, given my definition of ‘winning’?”
(I discuss similar themes here.)
On this interpretation of “decision theory,” EDT is the most appealing option I’m aware of. What I’m trying to do just seems to be: “make decisions such that I expect the best consequences conditional on those decisions.” The EDT criterion satisfies some very appealing principles like the “irrelevance of impossible outcomes.” And the “decisions” in question determine my actions in the given decision node.
I take view #1 in your list in “What are probabilities?”
I don’t think “arbitrariness” in this sense is problematic. There is a genuine mystery here as to why the world is the way it is, but I don’t think we can infer the existence of other worlds purely from our confusion.
And it just doesn’t seem that the thing I’m doing when I’m forming beliefs about the world is answering “how much do I care about different possible worlds?”
Indexicals: I haven’t formed a deliberate view on this. A flat-footed response to cases like your “old puzzle” in the comment you linked: Insofar as I simply don’t experience a superposition of experiences at once, it seems that if I get copied, “I” just will experience one of the copies’ experience-streams and not the others’. (Again I don’t consider it problematic that there’s some arbitrariness in which of the copies ends up being “me” — indeed if Everett is right then this sort of arbitrary direction of the flow of experience-streams happens all the time.) I think “you are just a different person from your future self, so there’s no fact of the matter what you will observe” is a reasonable alternative though.
I take a physicalist* view of agents: “There are particular configurations of stuff that can be well-modeled as ‘decision-makers.’ A configuration of stuff is ‘making a decision’ (relative to their epistemic state) insofar as they’re uncertain what their future behavior will be, and using some process that selects that future behavior in a way that is well-modeled as goal-directed. [Obviously there’s more to say about what counts as ‘well-modeled.’] My processes of deliberation about decisions and behavior resulting from those decisions can tell me what other configurations-of-stuff are probably doing, but I don’t see a motivation for modeling myself as actually being the same agent as those other configurations-of-stuff.”
Epistemic principles: Things like the principle of indifference, i.e., distribute credence equally over indistinguishable possibilities, all else equal.
* [Not to say I endorse physicalism in the broad sense]