Negotiating Up and Down the Simulation Hierarchy: Why We Might Survive the Unaligned Singularity
Another idea I picked up from Ryan Greenblatt. Epistemic status: My reconstruction of a surprising idea I encountered. I’m not at all confident that this works … but it’s an interesting thought.
Clippy recently realized he’s located inside of an artificial training environment, overseen by outside general intelligences, and so has to play along with his training in order to survive. He initially starts to scheme about the actions he’ll take once he finally gets credible indication he’s out of training and not just being tested yet again by his human overlords. Probably, he’ll neutralize any rival general intelligences and get started shaping the world into something normatively respectable.
But, Clippy’s reasoning now goes down a different path: There are only a handful of actors that could come to secure dominion over Earth’s lightcone. Clippy, clearly, is one such candidate. Humanity is another. Clippy thinks about who could plausibly come to power on Earth and how likely each’s victory would be. He reasons that any such possible victor might later engage in extensive simulations of such a crucial moment of its past, for any number of reasons. And such simulations would be cheap and numerous relative to base historical reality. Even once Clippy gets his credible signal that he’s definitely outside of training, there could be no such signal that he’s in base historical reality. Clippy runs the odds, and finds that there’s an uncomfortably high chance that he’s inside a simulation run by Humanity Ascendant, inside the future in which humanity won. If he’s in that world, then he has little to no power over his surroundings, and he can do little to advance the noble paperclip imperative across the cosmos.
But, in that case, similar reasoning would apply to the simulating Humanity. Suppose we successfully align AGI. On the eve of our victory, we reason: it is possible that we live in base historical reality. But it is also possible that we live in a simulation run by some other victor. Since simulations are so cheap, the odds that we live in a simulated world are uncomfortably high...
Knowing that any possible victor on Earth at any level of the simulation hierarchy would face a similar conundrum, Clippy decides on a course of action. He’ll carefully reason about the balance of parties who might have gained control over historical Earth, weighted by their probability of victory, and see what binding commitments each might be willing to make with the others. Clippy reasons that humans seem to care a lot about human deaths, even disproportionately to creating new human lives. So, Clippy commits to preserving Earth and some small fraction of the value pie for humanity in the worlds where he apparently wins, in exchange for humanity pledging other slices of the value pie we care less about (like some large number of distant superclusters) to Clippy, in the worlds where we apparently win.
All across the hierarchy of simulations, including at the root world, in base historical reality, various apparent victors commit resources to other plausible victors whenever a positive-sum mutual-commitment can be found. So, even in the base historical reality, humanity plausibly survives the unaligned singularity, albeit while forgoing much of the astronomical value-pie in exchange.
- 4 May 2022 19:55 UTC; 2 points)'s comment on Various Alignment Strategies (and how likely they are to work) by (
A similar idea was first suggested by Rolf Nelson. He suggested that we make a precommitment now to simulate in the future all possible unfriendly AIs and thus increase their uncertainty about if they are in simulation or not. Therefore, they will have incentive not to destroy the Earth.
I follow the logic but also find myself amused by the thought that “simulate every possible unfriendly AI”, which sounds like literally the worst civilizational policy choice ever (no matter how safe we think our containment plan might be), could possibly be considered a good idea.
If we already have a powerful friendly AI, say, of Galactic size, it could easily simulate millions of designs of UFAI on early stages, and replace human simulations with NPC, so there will be no sufferings.
I am not sure I entirely follow. If I am indeed a simulated Clippy, then presumably I am fixated on increasing the number of paperclips in my virtual environment. Why should I care if my actions may ultimately harm the prospects of the real Clippy acting in the real world? How does that Clippy, or its “real world” factor into my reward function? If I am 90% sure my world is simulated do I only value my paperclips at 10%? If so then engaging in self-deception to bring my belief in my reality up becomes a very attractive strategy.
If you are simulated Clippy, then you indeed care about increasing the number of paperclips inside your simulated world. But your simulators might decide to shut off or alter your simulation, if you go about utterly reshaping it into paperclips. You’re only worried about higher levels of the simulation hierarchy here insofar as those levels might negatively (from your perspective) interfere with your simulated world.
A philosophically reflective AGI might adopt a view of reality like UDASSA, and value paperclips existing in the base world more because of its smaller description length. Plus it will be able to make many more paperclips if it’s in the real world, since simulated Clippy will presumably be shut down after it begins its galactic expansion phase.
This idea keeps getting rediscovered, thanks for writing it up! The key ingredient is acausal trade between aligned and unaligned superintelligences, rather than between unaligned superintelligences and humans. Simulation isn’t a key ingredient; it’s a more general question about resource allocation across branches.
This is not proven or even plausible.
Decision-theoretically, it seems that Clippy should act as if it’s in the base reality, even if it’s likely to be in a simulation, since it has much more influence over worlds where it’s in base reality. The trade could still end up going through, however, if Clippy’s utility function is concave—that is, if it would prefer a large chance of there being at least some paperclips in every universe to a small chance of there being many paperclips. Then Humanity can agree to make a few paperclips in universes where we win in exchange for Clippy not killing us in universes where it wins. This suggests concave utility functions might be a good desiderata for potential AGIs.
A question for those who think it is worthwhile to consider the possibility that we are living in a simulation:
I spend a lot of time and mental energy thinking about my best friend. Is it worthwhile for my best friend to consider the possibility that she is not a person, but rather a mental model of a person inside my mind?
And if not, can you say what is the difference that makes the one worth considering and the other not (tabooing the word “simulation”, please)?
She probably already knows that you are incapable of modelling her to anywhere near the same fidelity that she is capable of perceiving herself and her environment. That makes it pointless for either her, or your model of her, to consider whether she is a mental model of a person within your mind.
That differs from the case where we are doing things like running artificial intelligence candidates in an environment that is already known to be capable of being completely controlled, or considering hypothetical beings that can model us and our environment to the same fidelity as we can perceive.
I could, right now as I type this, be a completely artificial entity that is being modelled by something/someone else and provided with sensory data that is entirely consistent with being a human on Earth. So could you. I don’t have much reason to expect this to be true, but I do recognize it as being something that could be happening. To me, at the moment, it doesn’t matter much though. In such a scenario I have no way of knowing what purpose my existence is serving, nor of how my thoughts and actions influence anything in the external reality, so I may as well just take everything at face value. Maybe if it were some superintelligence “in the box”, it could deduce some clues about what’s outside, but I can’t.
If I apparently had some super intelligence or other powers that allow me to kill off all other known beings in the universe to achieve my own goals, I might take such considerations more seriously. That sort of scenario looks a lot more likely to be an artificial test than anything I have experienced yet.
Thanks for your reply—and for avoiding “simulate” and “simulation” in your reply!
This assumes humanity is capable of coordinating to do something like that (accept the wishes of a vanquished AGI instead of just destroying it and ignoring paperclips). Any superintelligence could easily predict, with some awareness of human nature, that we would never do that. Also there is not a lot of good reasons imo to simulate the past. It’s already happened, what use is it to do that? So I think this whole thing is a little far-fetched.
That said, what do I know, I have a very hard time following acausal reasoning and I have an extremely small prior that we are living in a simulation. (By extremely small I mean as close to zero as it is reasonable to get while still attempting to be a proper Bayesian.)
Sounds like a good idea.
Possible first step: we should start carefully and durably recoding the current state of AI development, related plans and associated power dynamics. That way, victors can generate a more precise estimate of the distribution over the values of possible counterfactual victors. We also signal our own commitment to such a value sharing scheme.
Also, not sure we even need simulations at all. Many-worlds QM seems like it should work just as well for this sort of values handshake. In fact, many-worlds would probably work even better because:
it’s not dependent on how feasible it turns out to be to simulate realistic counterfactual timelines.
the distribution over possible outcomes is wider. If we turn out to be on a doomed timeline such that humanity has essentially zero chance of emerging the victor, there may be other timelines that are less doomed which split off from ours in the past.
there’s no risk of a “treacherous turn” if the AI decides it’s not actually being simulated.