Better a Brave New World than a dead one

[Note: There be massive generalizations ahoy! Please take the following with an extremely large grain of salt, as my epistemic status on this one is roughly 🤨.]

What are the odds of actually solving the alignment problem (and then implementing it in a friendly AGI!) before it’s too late? Eliezer Yudkowsky and Nick Bostrom both seem to agree that we are likely doomed when it comes to creating friendly AGI (as per Scott Alexander’s reading of this discussion) before the Paperclip Maximizers arrive. Of course, the odds being stacked against us is no excuse for inaction. Indeed, the community is working harder than ever, despite (what I perceive to be) a growing sense of pessimism regarding our ultimate fate. We should plan on doing everything possible to make sure that AGI, when developed, will be aligned with the will of its creators. However, we need to look at the situation pragmatically. There is a significant chance we won’t fully succeed at our goals, even if we manage to implement some softer safeguards. There is also a chance that we will fail completely. We need to at least consider some backup plans, the AGI equivalent of a “break glass in case of fire” sign—we never want to be in a case where the glass must be broken, but under some extremely suboptimal conditions, a fire extinguisher will save lives. So let’s say the fire alarm starts to ring (if we are lucky), and the sprinklers haven’t been installed yet. What then?

One possible answer: Terrorism! (Please note that terrorism is NOT actually the answer, and I will argue against myself in a few sentences. If this puts me on a list somewhere—oops.) Given that we’re talking about a world in which the alignment problem is not solved by the time we’ve reached Singularity (which for the sake of discussion let’s assume will indeed happen for now), we will not be able to trust any sufficiently advanced AI with significant confidence. Even indirect transfer of information with an unaligned AGI could be a massive infohazard, no matter how careful we think we’re being. The safest route at that point would seemingly be to destroy any and all AGI. The only problem is that by the time superintelligence arises (which may or may not coincide with the first AGIs), we will be outsmarted, and any action on our part is likely to be too late. Instead, a preemptive strike against any and all possible methods of achieving AGI would seem necessary, if one believes that the only options are between fully-aligned AGI and the end of the world. In order to do this one might decide to try to brick all CPUs, take over the world, even end civilization on Earth to buy the universe time, or commit other acts of terror in order to ensure the Singularity does not come to pass. Unfortunately(?) for a would-be terroristic rationalist, such a strategy would be quite useless. Stopping technological progress is merely a delaying tactic, and if the alarm bells aren’t ringing so loudly that everyone on Earth can hear it (at which point it would almost certainly be far too late for any human action to stop what’s coming), terroristic action would vastly increase the likelihood that when AGI is developed, it will not happen with our input in mind. No matter how important you think your cause is for breaking the internet, good luck explaining that to anyone outside your ingroup, or your local police force for that matter. So KILL ALL ROBOTS is effectively out (unless you plan on breaking more than just the Overton window). What else can we do? One possibility is to try to aim for a potentially more achievable middle ground between fully aligned and antagonistic AGI. After all, there are many different degrees of terrible when it comes to the phase space of all possible Singularities. For instance, given the three-way choice between experiencing nothing but excruciating pain forever, nothing but unending pleasure, or nothing at all, I’m fairly sure the majority of humans would choose pleasure, with perhaps a smaller minority choosing nothing (for philosophical or religious reasons), and only a tiny number of people choosing excruciating pain (although presumably they would immediately regret it, so per extrapolated volition, perhaps that shouldn’t count at all). As it happens, the (very rough) models we have of what runaway AGI might do with humanity tend to fall under those three categories fairly neatly. Take three examples, collected somewhat randomly from my memory of past discussions on Less Wrong:

  1. An AI optimized for expressed human happiness might end up simply drugging all humans to continuously evoke expressions of bliss, a dystopian world in which we would nonetheless likely not be unhappy, and may in fact experience a genuinely positive subjective state. This would not be a world many would choose to live in (although those in such a world would invariably claim to prefer their experience to other possible ones), but would almost certainly be better than total human extinction.

  2. An AI optimized to maximize Paperclips would likely convert us into Paperclips eventually, which would result in temporary suffering, but ultimately cause total human extinction.

  3. An AI optimized to maximize the number of living humans in existence would likely try to end up creating a matrix-like endless array of humans-in-a-tank, giving the absolute minimum required to keep them alive, with no concern for mental well-being. This would likely be a worse-than-death situation for humanity. Extinction would arguably be preferable over living in such a world.

If we have the chance to create an AGI which we know will be poorly aligned, and if inaction is not an option for whatever reason, it seems clear that it’s better to try to steer it closer to option 1 than 2 or 3, even at the cost of a progressive future for humanity. It should be reiterated that this strategy is only of relevance if everything else fails, but that does not mean that we shouldn’t be prepared for such a possibility.

EDIT: I do not actually think that we should try to build an AI which will drug us to a questionably pleasurable mindless oblivion. Rather, the above post is meant to function as a parable of sorts, provoking readers into contemplating what a contingency plan for a “less horrible” partially aligned AGI might look like. Please do not act on this post without significant forethought.