Accurate Models of AI Risk Are Hyperexistential Exfohazards

(Where “an exfohazard” is information which leads to bad outcomes if known by a large fraction of society.)

Let us suppose that we’ve solved the technical problem of AI Alignment — i. e., the problem of AI control. We have some method of reliably pointing our AGIs towards the tasks or goals we want, such as the universal flourishing of all sapient life. As per the Orthogonality Thesis, no such method would allow us to only point it at universal flourishing — any such method would allow us to point the AGI at anything whatsoever.

Which means that, if we succeed at the technical problem, there’ll be a moment at the very end of the world as we know it, where a person or a group of people will be making a decision regarding the future of the universe.

Above and beyond preventing an omnicide, we need to ensure that in the timelines where we do solve the technical problem, this decision is made right.

1. Establishing the Framework

The one good thing about the technical problem of alignment is that it makes hyperexistential risks — the risks of astronomical suffering — very unlikely.

The problem of AI Alignment can be viewed as the problem of encoding our preferences into an AGI, bit by bit. The strength of alignment tools, in turn, translates to how many bits we can encode. With the current methods of end-to-end training, we’re essentially sampling preferences at random. Perfect interpretability and parameter-surgery tools would allow us to encode an arbitrary amount of bits. The tools we’ll actually have will be somewhere between these two extremes.

“Build us our perfect world” is a very complicated ask, and it surely takes up many, many thousands of bits. That’s why alignment is hard.

“Build us a hell” is its mirror. It’s essentially the same ask, except for a flipped sign. As such, specifying it would require pretty much the same amount of bits.

Thus, in the timelines where we have alignment tools advanced enough to build a hell-making AGI, it’s overwhelmingly likely that we have the technical capability to build an utopia-building AGI. On the flipside, conditioning on our inability to build an utopia-builder, our tools are probably so bad we can’t come close to a hell-builder. In that case, we just sample some random preferences, and the AGI kills us quickly and painlessly.

Screwing up so badly we create a suffering-maximizer is vanishingly unlikely: it’s only possible in a very, very narrow range of technical capabilities.

Note the emphasis, though: technical capabilities.

The question of how these capabilities are used is an entirely different one.

2. How Bad Can It Be?

The preferences of most people, and most groups of people, are not safe to enforce upon the future.

It’s trivially true if we look upon the history: non-secular countries would consider it Just to see billions of people eternally tortured in their gods’ hells, xenophobic nations would genocide their enemies, power-hungry sociopaths ruling moral mazes would instinctively wish to erase all value from the universe, and so on.

Less than a century ago, most “progressive countries” would’ve erased today’s protected minorities. And even the random sample of contemporary “good guys” would probably be happy to do something extremely monstrous to, say, billionaires or rapists or something.

Then there’s the problem of “human chauvinism”. Not even mundane humanism is safe to maximize, inasmuch as it may exclude sapient animals, uploads, AIs, or aliens.

The bottom line is, most people are not drunk on overly nuanced sci-fi-ish altruistic philosophy. “What is the universal good, defined mathematically?” is a question that’s extremely off-distribution for them; most didn’t spend a minute in their life seriously contemplating it, and have none of the requisite background. They don’t have coherent preferences over the far-off future, and if forced to compile them on the spot, they’d produce something incoherent and pretty bad. Worse than annihilation.

And that’s not even taking into account people who’ve actually been selected for monstrous qualities, or incentivized to develop them. And such people are most likely to end up in power, inasmuch as power is correlated with signaling ruthlessness or radical patriotism, and being willing to climb to the top while trampling others underfoot.

So we can’t trust major systems, we can’t trust the public consensus, we can’t trust random people, and we can’t trust random powerful people.

As such, there’s a major sociopolitical dimension to AI Risk — beyond ensuring that we can point the AGI at the utopia, we need to ensure that the AGI actually ends up pointed at it.

Otherwise, getting AI Alignment solved would be much worse than staying back and letting humanity paperclip themselves out of existence.

3. What Does This Mean For Our AI Policy Work?

To be clear, that doesn’t mean that I think we should stop all sociopolitical activism. We just need to be careful about it. The specific outcomes we want to avoid are:

  • The higher echelons of some government or military develop an accurate model of AI Risk.

    • They’d want to enforce their government’s superiority, or national superiority, or ideological superiority, and they’d trample over the rest of humanity.

    • There are no eudaimonia-interested governments on Earth.

  • The accurate model of AI Risk makes its way into the public consciousness.

    • The “general public”, as I’ve outlined, is not safe either. And in particular, what we don’t want is some “transparency policy” where the AGI-deploying group is catering to the public’s whims regarding the AGI’s preferences.

    • Just look at modern laws, and the preferences they imply! Humanity-in-aggregate is not eudaimonia-aligned either.

  • A large subset of wealthy or influential people not pre-selected by their interest in EA/​LW ideas form an accurate model of AI Risk.

    • We’d either get some revenue-maximizer for a given corporation, or a dystopian dictatorship, or some such outcome.

    • And even if the particular influential person is conventionally nice, we get all the problems with sampling a random nice individual from the general population (the off-distribution problem).

By “accurate model” here I mean “the orthogonality thesis + the real power of intelligence + the complexity of human preferences”. The model with enough gears that it’d allow people to ask, “aligned to whose preferences?”, and then wonder if maybe it can be their personal preferences.[1]

Note what this doesn’t exclude:

  • Communicating a more opaque model of AI Risk to politicians/​the public. A model that just tells people that scaling up capabilities will lead to bad outcomes, without a particularly nuanced understanding of why.

    • This should still be sufficient to slow down the timelines, and pass regulations controlling AI development. E. g., the current uproar about AI art theft is a good example. It’s not even in the neighborhood of “hey, can we use this to create an Old Testament God?”, yet it can lead to semi-effective regulations.

  • Convincing select powerful individuals, e. g. the leadership of major AI Labs, or the most prominent AI researchers.

    • They weren’t strongly pre-selected for monstrousness or an adherence to some outdated ideology, and are plausibly both (1) at-least-conventionally-nice and (2) willing to listen to philosophical arguments.

    • If the system they’re embedded won’t go crazy trying to pressure them to develop e. g. a revenue-maximizer, they’ll probably be free to use their actual moral reasoning to decide on the AGI’s preferences.

    • (Not that I’m saying it’s completely safe. See: the recent case where a EA-aligned billionaire turned out to be… not conventionally nice.)

Also: transferring accurate models is pretty hard, actually. Most people don’t take ideas seriously, and the more abstract an idea is, the more unlikely it is to be taken seriously. I think it’s a major factor in why so many economically profitable technologies are successfully restricted. And inasmuch as “AI is dangerous” is much less abstract than “AI can be used to impose arbitrary values upon the future”, it actually shouldn’t be too difficult to increase the timelines without increasing the hyperexistential risk at the same time.

But one should still be mindful not to succeed too hard.

4. Wait, So Who Should Be In Charge Then?

A small group of philosophically-erudite, altruistically-minded people, probably sampled from this community.

No, I don’t like the optics on that either. It irks my aesthetic senses. It makes me feel like a non-genre-savvy supervillain, especially when I concretely imagine that future of building a doomsday device in secret in my basement.

But, like, it seems that any other approach is much more likely to lead to bad outcomes, after thinking it through at the object level?

One point to make, here:

Power Does Not Corrupt

I think “power corrupts” is factually incorrect, as platitudes go. It’s almost paradoxical: how exactly can a boost to your ability to enforce your preferences upon reality make these preferences worse? And, by implication, does that mean we should expect the powerless to be most un-corrupted? The people struggling to make ends meet, driven mad by hunger and deprivation, oppressed and trampled-over — we should expect them to be paragons of ascended philosophical virtue?

No, what corrupts isn’t power. What corrupts is the road to power, and what one has to do to keep power.

  1. The powerful are pre-selected to be the sort of people who primarily optimize for getting power. Thus, they’re much more likely than average to be the kinds of people who’d use the handles of knives embedded in others’ back as handholds.

  2. Once one has power, it’s a constant struggle to maintain that power while protecting yourself from the power-hungry individuals described in (1). Even if you started out decent, it’s pretty difficult not to sink to their level. And if you started out bad, why, you’d just get worse.

This is why most people in power are corrupt. Not because “power” has a magical property of turning people into monsters.

And the people who’d acquire absolute power over the future in this hypothetical would acquire it by very different means, compared to the usual. And they would not need to hold onto it.

So, while it’s not guaranteed that they’d be nice, there at least isn’t any prior reason to think they’d be evil. Updating off “they’re in power” is incorrect, here: corruption is correlated with power, but not caused by it.

… Which is not to say our community is safe. We’re as vulnerable to being taken over by power-seekers as any other group.

This is an additional reason not to broadcast how much potential power lies down this road.

5. Critique of Specific Ideas

Long Reflection

There’s a plan that goes:

  • Figure out “strawberry alignment” — i. e., how to make an AI pursue some concrete, localized goal like “synthesize a single strawberry”, without committing omnicide and e. g. tiling the universe with strawberries or weird upstream-strawberry-correlates.

    • This is contrasted with more complex goals like “build an utopia”, which combine the difficulty of AI control with the philosophical difficulty of “what even is an utopia?”.

  • Use this weakly-aligned AI to “end the acute risk period” — somehow slow down or halt unsafe AGI research world-wide.

  • There’s some exit condition on this research ban: maybe it’s lifted after a century, maybe some person or group of people have the authority to lift it, maybe there’s some other recognition function on when it’s fine to do so.

  • It’s implied that, once the ban is lifted, humanity has matured enough to figure out its preferences and build an AGI implementing them safely.

First off, I’m skeptical that “strawberry alignment” is a thing. “Create a strawberry” is deeply value-laden in itself, it includes all the clauses like “but don’t murder people over it” and what a “real strawberry” means, etc. I think if we can get the AGI to do that, if we can encode this many bits of preferences into it, we can probably just say “build an utopia” and have that command be safely executed too. The AGI will either know what we really mean, or help us figure it out.

However, if this is possible, I think this just leads to us building a hell. An AGI that can’t build an utopia can’t distinguish a hell from an utopia, so the recognition function on “what preferences should we enforce upon the future” is implemented by the entire humanity, and...

I think about it as… Imagine a list of existential and hyperexistential risks, ordered by probability-of-occurrence. This scheme doesn’t somehow resolve AI Risk, in a complex way that updates the probabilities of all the risk below it. It just strikes off this first item off the list.

And I think what we have just below it is “Eternal-Dystopia Risk, probability 90%+”.

So we end the acute risk period, and then immediately find ourselves in a totalitarian hellscape where sub-AGI drone swarms quell any rebellion, trillions of simulated humans are exploited for cheap labor, cults with procedurally-generated synthetic ideologies burn through minds like wildfire, etc., etc.

And then the condition for lifting the ban on AGI research is met, and all these marvels are enforced upon the future forever.

Again, this is worse than just sitting back and let omnicide happen. So if it’s possible to strawberry-align but not utopia-align an AGI, and you face the choice between proliferating strawberry-alignment and doing nothing, it’s better to do nothing.

These Arms Race Models

Point 1: I think the people in charge of such decisions aren’t going to be using nuanced rational models like this. They weren’t an accurate description of governments’ thinking regarding nuclear MAD, and they won’t be an accurate description of the AGI race.

In particular, I expect no-one is going to pay attention to the “safety generalization” parameter. For our work to be used to help these heathens lock-in their barbaric values? No, better classify all of it!

Point 2: If the Powers that Be do coordinate to finish alignment research before implementing their AGI, and so succeed at aligning their AGI with their values, that would be a hyperexistential catastrophe.

If the relevant players know that what they’re racing over isn’t a just weapon, but the entirety of the future, then humanity has already lost.

Dying With Dignity

It’s fine to maximize for “death with dignity”, i. e. to attempt to increase the log odds of humanity’s survival… if you think that “not dying ” is always preferable.

But, uh. What if you’re successful? What if you get enough dignity points not to die...

And then survive in an undignified manner?

As above: better to die, I think.

6. Conclusion

I don’t think we should give up and aim for an omnicide!

I think it’s totally possible to get the sociopolitical part of the problem right! Especially in the possibility-branch where we succeed at technical alignment. There aren’t many actors, right now, who are projected to eventually get the capability to deploy an AGI, and they’re not controlled by anti-altruistic people, and there’s no (to my knowledge) any powerful anti-altruistic organizations that take AI Alignment ideas seriously! We can totally get this done.

But to do that, we should shut up about some aspects of the problem. Public proliferation of accurate models of AI Risk is not conductive to a marvelous future.

Raising the awareness of generic “dangers of AI capabilities”, and inviting funding towards generic “AI Safety research”? Sure, that’s fine. And also much easier than actually transferring a nuanced understanding! In fact, transferring an accurate model is probably so difficult you shouldn’t need to worry about accidentally doing it at all! (I even approve of the general message of e. g. this post.)

But if you do find a way to greatly increase the timelines at the cost of cluing a lot of people in regarding what’s really on offer here, regarding what the good outcome can be, regarding the fact that “AI Alignment” research is actually “AI Control” research, don’t do it. There are fates worse than death, and you’d be beckoning them.

… I’m not sure I should’ve even written this post, to be honest. I think it’d be pretty bad if “should we be supervillains trying to unilaterally steer the future of humanity?” becomes a frequent part of discourse. And of course I’m also spelling out here why all the governments/​corporations/​sociopaths should be looking in this direction, and while the effect should be very small, I’m not exactly redirecting their attention.

But the sentiment that we should do more public outreach has been picking up this year, and I think it’d lead to worse-in-expectation outcomes if I don’t present this counter-argument.

I am also acutely aware that this potentially stokes the flames of adversity within the EA/​LW community, between those who’d disagree with me and those who’d agree.

But, yeah. I think we should all shut up about certain matters.

  1. ^

    There’s an objection here that goes, “but come on, the person who’d be actually coding-in the AI’s preferences won’t be the embodied avatar of the Government/​the Will of the Public/​the power-hungry sociopath personally, it’d be some poor ML engineer, and they’d realize what a mistake this is and go rogue and heroically code-in altruistic preferences instead!”. And yeah, that’s totally how it’d go in real life! You know, like how back in the early March of this year, a bunch of Russian siloviki and oligarchs realized how much suffering a single person’s continued existence will bring upon the world and upon them personally, and just coordinated to unceremoniously shank him. That happened, right?

    It’s not really how this works.