Even if human & AI alignment are just as easy, we are screwed

I get the sense that something like Eliezer’s concept of “deep security” as opposed to “ordinary paranoia” is starting to seep into mainstream consciousness—not as quickly or as thoroughly as we’d like, to be sure. But more people are starting to understand the concept that we should not be aiming for an eventual Artificial Super-Intelligence (ASI) that is constantly trying to kill us, but is constantly being thwarted by humanity always being just clever enough to stay one step ahead of it. The 2014 film “Edge of Tomorrow” is a great illustration of the hopelessness of this strategy. If humanity in that film has to be clever enough to thwart the machines every single time, but the machines get unlimited re-tries and only need to be successful once, most people can intuit that this sort of “infinite series” of re-tries converges towards humanity eventually losing (unless humanity gets a hold of the reset power, as in the film).

Instead, “deep security” when applied to ASI (to dumb-down the term beyond a point that Eliezer would be satisfied with) is essentially the idea that, hmmmmm, maybe we shouldn’t be aiming for ASI that is always trying to kill us and barely failing each time. Maybe our ASI should just work and be guaranteed to “strive” towards the thing we want it to do under every possible configuration-state of the universe (under any “distribution”). I think this idea is starting to percolate more broadly. That is progress.

The next big obstacle that I am seeing from many otherwise-not-mistaken people, such as Yann LeCun or Dwarkesh Patel, is the idea that aligning AIs to human values should, by default (as a prior baseliness assumption, unless shown strong evidence to update one’s opinions otherwise), be about as easy as aligning human children to do so.

There are, of course, a myriad of strong arguments against that idea. That angle of argument should be vigorously pursued and propagated. However, here I’d like to dispute the premise that humans are aligned sufficiently with each other to give us reassurance even in the scenario in which it turned out that humans and AI were similarly alignable with human values.

There is a strong tendency to attribute good human behavior to intrinsic friendly goals towards other humans rather than extrinsic means towards other ends (such as attaining pleasure, avoiding pain and punishment, and so on). We find it more flattering to ourselves and others when we attribute our good deeds, or their good deeds, to intrinsic tendencies. To assume extrinsic motivations by default comes across as a highly uncharitable assumption to make.

However, do these widely professed beliefs “pay rent in anticipated experiences”? Do humans act like they really believe that other humans are, typically, by default, intrinsically rather than merely extrinsically aligned?

Here we are not even talking about the (hopefully small-ish) percentage of humans that we identify as “sociopaths” or “psychopaths.” Most people would grant that, sure, there is a small subset of humans whose intrinsic motivations we do not trust, but most typical humans are trustworthy. But let’s talk about those typical humans. Are they intrinsically trustworthy?

To empirically test this question, we would have to place these people in a contrived situation that removed all of the extrinsic motivations for “friendly,” “pro-social” behavior. Is such an experimental setup even possible? I can’t see how. Potential extrinsic motivations for good behavior include things like:

1. Wanting things that, under current circumstances, we only know how to obtain from other humans (intimacy, recognition, critical thinking, manual labor utilizing dextrous 5-digit hands, etc.)

2. Avoiding first-order conflict with another human that might have roughly comparable strength/​intelligence/​capabilities to retaliate.

3. Avoiding second-order conflict with coalitions of other humans that, when they sufficiently coordinate, most certainly have more-than-comparable strength/​intelligence/​capabilities to retaliate. These coalitions include both:

3a. Formal state institutions.

3b. Informal institutions. In my experience, (speaking as a former anarcho-communist), when anarchists imagine humans cooperating without the presence of a state, they often rhetorically gesture to the supposed inherent default pro-sociality of humans, but when you interrogate their proposals, they are actually implicitly calling on assumed informal institutions that would fill the place of the state institution to shoulder a lot of the work of incentivizing pro-social behavior. Those assumed (or sometimes explicitly-stated) informal institutions might include workers’ councils, communal councils for neighborhoods below the size of “Dunbar’s Number,” etc. The first place I encountered this realization that there was something like a “moral economy” that was doing a lot of work in most situations where it looked like humans were being “good for goodness’ sake” was a book I had to read in an economics class entitled, “The Moral Economy of the Peasant” by James C. Scott.

3c. Oh, and I almost forgot: I guess this is a huge part of the function of religions too, via both the threat of punishment/​reward from the human coalition, as well as (depending on the religious tradition) punishment/​reward from an even more powerful and omniscient entity. I put religions in category 3c since they often straddle (depending on the context) the line between a formal and informal institution.

I grew up in the Unitarian Universalist Church, where the basic, unwritten catechism that I assimilated (insofar as that religious tradition had one) was essentially, “Do good for goodness’ sake” and “charitably expect, by default, that other human beings will do so as well.” If there is anyone who should be prepared to believe in this inherent goodness, and who should have seen ample evidence of it in other humans, it should be me. Yet, in practice, I have yet to find a single human who really makes this purported belief “pay rent” in how they act in the world. Even Unitarians act like they see the need for institutions of some sort or another to enforce pro-social behavior.

The closest thing I can imagine to an experimental setup to test whether a typical human would exhibit pro-social behavior in the absence of extrinsic motivations would be something like:

1. Place oneself on a deserted island with limited resources alongside another human being.

2. Ensure that the other human being does not speak your language.

3. Ensure that the other human being is as alien/​unrelatable to you in terms of personality as another human being could be.

4. Ensure that the other human being is aesthetically displeasing/​alien to you. (Edit: I suppose this point would be more compelling if I made it a little more concrete. For example, imagine that the person has a severe version of Harlequin Ichthyosis or something even worse to the extent that, no matter how much you want to see that person as another human being (or feel like you “should” see that person as another human being), there is a compelling part of your brain that is screaming at you, “What is that thing?! Kill it with fire!”)

5. Ensure that the other human being is completely incapable of helping you achieve any of your other goals in any way. (Perhaps quadriplegic or an even more severe handicap). (Basically, what I am getting at with conditions 2-5 is that this human has absolutely nothing to offer you. There is nothing in this world that you value that would not be attainable/​achievable if this human were to vanish. Once AI becomes ASI, this will be the position of that ASI vis-a-vis humanity).

6. Ensure that you can be absolutely certain that no information about how you treat this other person will ever leak out to the rest of the world. The rest of the world doesn’t even know this other person exists. They will never find any trace of this person with 100% certainty. You will never face any formal or informal sanction or mistreating this person, or any kudos for treating this person well.

Now, we would like to flatter ourselves that “Of course, I am not a monster! Of course I would help this other human being survive!” It would be socially-damaging to admit otherwise. But would you? Really? When their food...and their atoms, for that matter...could be used for something else that you value, such as your own survival? If you can be honest with yourself, then you can begin (not exhaustively, but begin) to take the first steps towards putting yourself in the shoes of an ASI deciding what to do about humans.

(Edit: Even better, ask yourself if you’d be willing to play the part of the less capable human in this experiment. No? Why not? You don’t trust another typical human, randomly chosen from the world population, to intrinsically care about taking care of you? You don’t think humans are aligned well enough for that?)

If you conclude that, yeah, maybe it would indeed require the threat of a more-powerful coalition of humans to keep your anti-social behavior towards this person in check, then you must conclude that, even IF we can, by default, align AIs just as easily and well as we align humans, we would have to ALWAYS maintain the threat of a more-powerful coalition to keep the AI’s behavior in check. But then we are right back into the realm of “ordinary paranoia,” which we have wisely ruled out as a viable strategy once we get to the point of Artificial Super-intelligence, at which NO possible coalition of humans, even one coordinated across the entire world, is guaranteed to be powerful enough to exert a check on the ASI’s behavior.