Why a Human (Or Group of Humans) Might Create UnFriendly AI Halfway On Purpose

First, some quotes from Eliezer’s contributions to the Global Catastrophic Risks anthology. First, from Cognitive biases potentially affecting judgement of global risks:

All else being equal, not many people would prefer to destroy the world. Even faceless corporations,meddling governments, reckless scientists, and other agents of doom, require a world in which to achieve their goals of profit, order, tenure, or other villainies. If our extinction proceeds slowly enough to allow a moment of horrified realization, the doers of the deed will likely be quite taken aback on realizing that they have actually destroyed the world. Therefore I suggest that if the Earth is destroyed, it will probably be by mistake.

And from Artificial Intelligence as a Positive and Negative Factor in Global Risk:

We can therefore visualize a possible first-mover effect in superintelligence. The first mover effect is when the outcome for Earth-originating intelligent life depends primarily on the makeup of whichever mind first achieves some key threshold of intelligence—such as criticality of self-improvement. The two necessary assumptions are these:

• The first AI to surpass some key threshold (e.g. criticality of self-improvement), if unFriendly, can wipe out the human species.
• The first AI to surpass the same threshold, if Friendly, can prevent a hostile AI from coming into existence or from harming the human species; or find some other creative way to ensure the survival and prosperity of Earth-originating intelligent
life.

More than one scenario qualifies as a first-mover effect. Each of these examples reflects a different key threshold:

• Post-criticality, self-improvement reaches superintelligence on a timescale of weeks or less. AI projects are sufficiently sparse that no other AI achieves criticality before the first mover is powerful enough to overcome all opposition. The key threshold is criticality of recursive self-improvement.
• AI-1 cracks protein folding three days before AI-2. AI-1 achieves nanotechnology six hours before AI-2. With rapid manipulators, AI-1 can (potentially) disable AI-2′s R&D before fruition. The runners are close, but whoever crosses the finish line first, wins. The key threshold is rapid infrastructure.
• The first AI to absorb the Internet can (potentially) keep it out of the hands of other AIs. Afterward, by economic domination or covert action or blackmail or supreme ability at social manipulation, the first AI halts or slows other AI projects so that no other AI catches up. The key threshold is absorption of a unique resource.

I think the first quote is exactly right. But it leaves out something important. The effects of someone’s actions do not need to destroy the world in order to be very, very, harmful. These definitions of Friendly and unFriendly AI are worth quoting (I don’t know how consistently they’re actually used by people associated with the SIAI, but they’re useful for my purposes):

A “Friendly AI” is an AI that takes actions that are, on the whole, beneficial to humans and humanity; benevolent rather than malevolent; nice rather than hostile. The evil Hollywood AIs of The Matrix or Terminator are, correspondingly, “hostile” or “unFriendly”.

Again, an action does not need to destroy the world to be, on the whole, harmful to humans and humanity; malevolent rather than benevolent. An assurance that a human or humans will not do the former is no assurance that they will not do the latter. So if there ends up being a strong first-mover effect in the development of AI, we have to worry about the possibility that whoever gets control of the AI will use it selfishly, at the expense of the rest of humanity.

The title of this post says “halfway on purpose” instead of “on purpose,” because in human history even the villains tend to see themselves as heroes of their own story. I’ve previously written about how we deceive ourselves so as to better deceive others, and how I suspect this is the most harmful kind of human irrationality.

Too many people—at least, too many writers of the kind of fiction where the villain turns out to be an all-right guy in the end—seem to believe that if someone is the hero of their own story and genuinely believes they’re doing the right thing, they can’t really be evil. But you know who was the hero of his own story and genuinely believed he was doing the right thing? Hitler. He believed he was saving the world from the Jews and promoting the greatness of the German volk.

We have every reason to think that the psychological tendencies that created these hero-villains are nearly universal. Evolution has no way to give us nice impulses for the sake of having nice impulses. Theory predicts, and observation confirms, that we tend to care more about blood-relatives than mere allies and allies more than strangers. As Hume observed (remarkably, without any knowledge of Hammilton’s rule) “A man naturally loves his children better than his nephews, his nephews better than his cousins, his cousins better than strangers, where every thing else is equal.” And we care more about ourselves than any single other individual on the planet (even if we might sacrifice ourselves for two brothers or eight cousins.)

Most of us are not murderers, but then most of have never been in a situation where it would be in our interest to commit murder. The really disturbing thing is that there is much evidence that ordinary people can become monsters as soon as the situation changes. Science gives us the Stanford Prison Experiment and Milgram’s experiment on obedience to authority, history gives us even more disturbing facts about how many soldiers commit atrocities in war time. Of the soldiers who came from societies where atrocities are frowned on, most of them must have seemed perfectly normal before they went off to war. Probably most of them, if they’d thought about it, would have sincerely believed they were incapable of doing such things.

This makes a frightening amount of evolutionary sense. There’s reason for evolution to, as much as possible, give us conditional rules for behavior so we only do certain things when it’s fitness increasing to do so. Normally, doing the kind of things done during the Rape of Nanking leads to swift punishments, but the circumstances when such things actually happen tend to be circumstances where punishment is much less likely, where the other guys are trying to kill you anyway and your superior officer is willing to at minimum look the other way. But if you’re in a situation where doing such things is not in your interest, where’s the evolutionary benefit of even being aware of what you’re capable of?

Taking this all together, the risk is not that someone will deliberately use AI to harm humanity (do it on purpose). The risk is that they’ll use AI to harm humanity for selfish reasons, while persuading themselves they’re actually benefiting humanity (doing it halfway on purpose.) If whoever gets control of a first-mover scenario sincerely believed, prior to gaining unlimited power, that they really wanted to be really, really careful not to do that, that’s no assurance of anything, because they’ll have been thinking that before the situation changed and there was a chance for the conditional rule, “Screw over other people for personal gain if you’re sure if getting away with it” triggered.

I don’t want to find out what I’d do with unlimited power. Or rather, all else being equal I would like to find out, but I don’t think putting myself in a position where I actually could find out would be worth the risk. This is in spite of the fact that the fact that I am even worrying about these things may be a sign that I’d be less of a risk than other people. That should give you an idea of how little I would trust other people with such power.

The fact that Eliezer has stated his intention to have the Singularity Institute create FOOM-capable AI doesn’t worry me much, because I think the SIAI is highly unlikely to succeed at that. I think if we do end up in a first-mover scenario, it will probably be the result of some project backed by a rich organization like IBM, the United States Department of Defense, or Google.

Forgetting about that, though, this looks to me like an absolutely crazy strategy. Eliezer has said creating FAI will be a very meta operation, and I think I heard him once mention putting prospective FAI coders through a lot of rationality training before beginning the process, but I have no idea why he would think those are remotely sufficient safeguards for giving a group of humans unlimited power. Even if you believe there’s a significant risk that creating FOOM-capable FAI could be necessary to human survival, shouldn’t, in that case, there be a major effort to first answer the question, “Is there any possible way to give a group of humans unlimited power without it ending in disaster?”

More broadly, given even a small chance that the future of AI will end up in some first-mover scenario, it’s worth asking, “what can we do to prevent some small group of humans (the SIAI, a secret conspiracy of billionaires, a secret conspiracy of Google employees, whoever) from steering a first-mover scenario in a direction that’s beneficial to themselves and perhaps their blood relatives, but harmful to the rest of humanity?”