Even if we lose, we win

Epistemic status: Updating on this comment and taking into account uncertainty about my own values, my credence in this post is around 50%.

TLDR: Even in worlds where we create an unaligned AGI, it will cooperate acausally with counterfactual FAIs—and spend some percentage of its resources pursuing human values—as long as its utility function is concave in resources spent. The amount of resources UAI spends on humans will be roughly proportional to the measure of worlds with aligned AGI, so this does not change the fact that we should be working on alignment.

Assumptions

  1. Our utility function is concave in resources spent; e.g. we would prefer a 100% chance of 50% of the universe turning into utopia to a 50% chance of 100% of the universe turning into utopia, assuming that the rest of the universe is going to be filled with something we don’t really care about, like paperclips.

  2. There is a greater total measure of worlds containing aligned AGI with concave utility than worlds containing anti-aligned AGI with concave utility, anti-aligned meaning it wants things that directly oppose our values, like suffering of sentient beings or destruction of knowledge.

  3. AGIs will be able to predict counterfactual other AGIs well enough for acausal cooperation.

  4. (Added in response to these comments) AGIs will use a decision theory that allows for acausal cooperation between Everett Branches (like LDT).

Acausal cooperation between AGIs

Let’s say two agents, Alice and Bob, have utilities that are concave in money. Let’s say for concreteness that each agent’s utility is the logarithm of their bank account balance, and they each start with $10. They are playing a game where a fair coin is flipped. If it lands heads, Alice gets $10, and if it lands tails, Bob gets $10. Each agent’s expected utility here is

If, instead, Alice and Bob both just receive $5, their expected utility is

Therefore, in the coin flip game, it is in both agents’ best interests to agree beforehand to have the winner pay the loser $5. If Alice and Bob know each other’s source code, they can cooperate acausally on this after the coin is flipped, using Löbian Cooperation or similar. This is of course not the only Pareto-optimal way to cooperate, but it is clearly at the Galactic Schelling Point of fairness, and both agents will have policies that disincentivize “unfair” agreements, even if they are still on the Pareto frontier (e.g. Alice pays Bob $6 if she wins, and Bob pays Alice $4 if he wins). If the coin is weighted, a reasonable Galactic Schelling Point would be for Alice to pay Bob in proportion to Bob’s probability of winning, and vice-versa, so that both agents end up always getting what would have been their expected amount of money (but more expected utility because of concavity).

Now, let’s say we have a 10% chance of solving alignment before we build AGI. For simplicity, we’ll say that in the other 90% of worlds we get a paperclip maximizer whose utility function is logarithmic in paperclips. Then, in 10% of worlds, FAI, maximizing our CEV as per LDT, will reason about the other 90% of worlds where Clippy is built. Likewise, Clippy will reason about the counterfactual FAI worlds. By thinking about each other’s source code, FAI and Clippy will be able to cooperate acausally like Alice and Bob, each turning their future lightcone into 10% utopia, 90% paperclips. Therefore, we get utopia either way! :D

Note that, even though we get utopia in Clippy’s worlds, we still want to maximize the probability of solving alignment, since the amount of the world that gets CEV-optimized is proportional to that probability.

This means we have a new secondary target: If the AGI we build doesn’t have our CEV as its utility function, we want its utility function to be concave (like ours is). P(concave-AGI) determines how many worlds we get a sliver of; P(FAI|concave-AGI) determines how big that sliver is. There’s a compelling argument that the former is actually more important, and therefore we should be focusing on creating concave-AGI. Also, if P(FAI) is large enough, the FAI might have enough realityfluid to trade that in worlds where concave-UAI is created, it considers it worth it to avoid killing currently-alive humans, or at least brain scan us all before killing us so we can be revived in sliver-utopia. So, if Clippy might save us instead of killing us, maybe it’s actually good to accelerate AI capabilities??? (I am far less confident in this claim than I am in any of the other claims in this post. And even if this is true, alignment is probably the better thing to work on anyway because of neglectedness.)

What about convex utilities?

If an AGI has a convex utility function, it will be risk-seeking rather than risk-averse, i.e. if we hold constant the expected amount of resources it has control over, it will want to increase the variance of that, rather than decrease it. Fix as the expected amount of resources. If is total amount of resources in one universe, the highest possible variance is attained by the distribution that the AGI already gets: chance of getting the entire universe and chance of getting nothing. Therefore, a UAI with convex utility will not want to cooperate acausally at all.

Notes:

  • The above only holds with AGIs that are indifferent to our values, and whose values we are indifferent to, i.e. we don’t care how many paperclips there are and Clippy doesn’t care how much utopia there is. I might make another post discussing more complex scenarios with AGIs whose values are correlated or anti-correlated with ours on some dimensions, since, even though I consider such AGIs to be quite unlikely, their possibility is decision-relevant to those with suffering-focused ethics.

  • This is not Evidential Cooperation in Large Worlds. ECL is when Alice and Bob don’t have access to each other’s source code, and just give the $5 based on the assumption that they are both running the same algorithm. This reasoning is valid if Alice and Bob are exact clones in the exact same situation, but its validity diminishes as you deviate from that idealized scenario. I might make another post on why I think ECL doesn’t actually work in practice.

  • But isn’t Alice and Bob having full access to each other’s source code is also an idealized scenario? Yes, it is, but I think real life is close enough to this ideal that acausal cooperation is still possible.

  • The above only holds when our uncertainty over whether or not we solve the alignment problem is environmental rather than logical. We might think we have a 10% chance of solving alignment, but Clippy looking back on the past might find that, because we were drawn to the same stupid alignment plans in almost all timelines, our probability of success was more like , low enough that even giving us a solar system isn’t worth it. Therefore, it might be a good idea for all (or at least a large group of) alignment researchers to coordinate around pursuing the same specific alignment plan based on the result of a quantum RNG, or something like that.

Edit 01/​15: Mixed up concave and convex. Concave=concave down=second derivative is negative=risk-averse (in a utility function)