Ethical Injunctions

“Would you kill ba­bies if it was the right thing to do? If no, un­der what cir­cum­stances would you not do the right thing to do? If yes, how right would it have to be, for how many ba­bies?”
hor­rible job in­ter­view question

Swap­ping hats for a mo­ment, I’m pro­fes­sion­ally in­trigued by the de­ci­sion the­ory of “things you shouldn’t do even if they seem to be the right thing to do”.

Sup­pose we have a re­flec­tive AI, self-mod­ify­ing and self-im­prov­ing, at an in­ter­me­di­ate stage in the de­vel­op­ment pro­cess. In par­tic­u­lar, the AI’s goal sys­tem isn’t finished—the shape of its mo­ti­va­tions is still be­ing loaded, learned, tested, or tweaked.

Yea, I have seen many ways to screw up an AI goal sys­tem de­sign, re­sult­ing in a de­ci­sion sys­tem that de­cides, given its goals, that the uni­verse ought to be tiled with tiny molec­u­lar smiley-faces, or some such. Gen­er­ally, these deadly sug­ges­tions also have the prop­erty that the AI will not de­sire its pro­gram­mers to fix it. If the AI is suffi­ciently ad­vanced—which it may be even at an in­ter­me­di­ate stage—then the AI may also re­al­ize that de­ceiv­ing the pro­gram­mers, hid­ing the changes in its thoughts, will help trans­form the uni­verse into smiley-faces.

Now, from our per­spec­tive as pro­gram­mers, if we con­di­tion on the fact that the AI has de­cided to hide its thoughts from the pro­gram­mers, or oth­er­wise act willfully to de­ceive us, then it would seem likely that some kind of un­in­tended con­se­quence has oc­curred in the goal sys­tem. We would con­sider it prob­a­ble that the AI is not func­tion­ing as in­tended, but rather likely that we have messed up the AI’s util­ity func­tion some­how. So that the AI wants to turn the uni­verse into tiny re­ward-sys­tem coun­ters, or some such, and now has a mo­tive to hide from us.

Well, sup­pose we’re not go­ing to im­ple­ment some ob­ject-level Great Idea as the AI’s util­ity func­tion. In­stead we’re go­ing to do some­thing ad­vanced and re­cur­sive—build a goal sys­tem which knows (and cares) about the pro­gram­mers out­side. A goal sys­tem that, via some non­triv­ial in­ter­nal struc­ture, “knows it’s be­ing pro­grammed” and “knows it’s in­com­plete”. Then you might be able to have and keep the rule:

“If [I de­cide that] fool­ing my pro­gram­mers is the right thing to do, ex­e­cute a con­trol­led shut­down [in­stead of do­ing the right thing to do].”

And the AI would keep this rule, even through the self-mod­ify­ing AI’s re­vi­sions of its own code, be­cause, in its struc­turally non­triv­ial goal sys­tem, the pre­sent-AI un­der­stands that this de­ci­sion by a fu­ture-AI prob­a­bly in­di­cates some­thing defined-as-a-malfunc­tion. More­over, the pre­sent-AI knows that if fu­ture-AI tries to eval­u­ate the util­ity of ex­e­cut­ing a shut­down, once this hy­po­thet­i­cal malfunc­tion has oc­curred, the fu­ture-AI will prob­a­bly de­cide not to shut it­self down. So the shut­down should hap­pen un­con­di­tion­ally, au­to­mat­i­cally, with­out the goal sys­tem get­ting an­other chance to re­calcu­late the right thing to do.

I’m not go­ing to go into the deep dark depths of the ex­act math­e­mat­i­cal struc­ture, be­cause that would be be­yond the scope of this blog. Also I don’t yet know the deep dark depths of the math­e­mat­i­cal struc­ture. It looks like it should be pos­si­ble, if you do things that are ad­vanced and re­cur­sive and have non­triv­ial (but con­sis­tent) struc­ture. But I haven’t reached that level, as yet, so for now it’s only a dream.

But the topic here is not ad­vanced AI; it’s hu­man ethics. I in­tro­duce the AI sce­nario to bring out more starkly the strange idea of an eth­i­cal in­junc­tion:

You should never, ever mur­der an in­no­cent per­son who’s helped you, even if it’s the right thing to do; be­cause it’s far more likely that you’ve made a mis­take, than that mur­der­ing an in­no­cent per­son who helped you is the right thing to do.

Sound rea­son­able?

Dur­ing World War II, it be­came nec­es­sary to de­stroy Ger­many’s sup­ply of deu­terium, a neu­tron mod­er­a­tor, in or­der to block their at­tempts to achieve a fis­sion chain re­ac­tion. Their sup­ply of deu­terium was com­ing at this point from a cap­tured fa­cil­ity in Nor­way. A ship­ment of heavy wa­ter was on board a Nor­we­gian ferry ship, the SF Hy­dro. Knut Hauke­lid and three oth­ers had slipped on board the ferry in or­der to sab­o­tage it, when the sabo­teurs were dis­cov­ered by the ferry watch­man. Hauke­lid told him that they were es­cap­ing the Gestapo, and the watch­man im­me­di­ately agreed to over­look their pres­ence. Hauke­lid “con­sid­ered warn­ing their bene­fac­tor but de­cided that might en­dan­ger the mis­sion and only thanked him and shook his hand.” (Richard Rhodes, The Mak­ing of the Atomic Bomb.) So the civilian ferry Hy­dro sank in the deep­est part of the lake, with eigh­teen dead and twenty-nine sur­vivors. Some of the Nor­we­gian res­cuers felt that the Ger­man sol­diers pre­sent should be left to drown, but this at­ti­tude did not pre­vail, and four Ger­mans were res­cued. And that was, effec­tively, the end of the Nazi atomic weapons pro­gram.

Good move? Bad move? Ger­many very likely wouldn’t have got­ten the Bomb any­way… I hope with ab­solute des­per­a­tion that I never get faced by a choice like that, but in the end, I can’t say a word against it.

On the other hand, when it comes to the rule:

“Never try to de­ceive your­self, or offer a rea­son to be­lieve other than prob­a­ble truth; be­cause even if you come up with an amaz­ing clever rea­son, it’s more likely that you’ve made a mis­take than that you have a rea­son­able ex­pec­ta­tion of this be­ing a net benefit in the long run.”

Then I re­ally don’t know of any­one who’s know­ingly been faced with an ex­cep­tion. There are times when you try to con­vince your­self “I’m not hid­ing any Jews in my base­ment” be­fore you talk to the Gestapo officer. But then you do still know the truth, you’re just try­ing to cre­ate some­thing like an al­ter­na­tive self that ex­ists in your imag­i­na­tion, a fa­cade to talk to the Gestapo officer.

But to re­ally be­lieve some­thing that isn’t true? I don’t know if there was ever any­one for whom that was know­ably a good idea. I’m sure that there have been many many times in hu­man his­tory, where per­son X was bet­ter off with false be­lief Y. And by the same to­ken, there is always some set of win­ning lot­tery num­bers in ev­ery draw­ing. It’s know­ing which lot­tery ticket will win that is the epistem­i­cally difficult part, like X know­ing when he’s bet­ter off with a false be­lief.

Self-de­cep­tions are the worst kind of black swan bets, much worse than lies, be­cause with­out know­ing the true state of af­fairs, you can’t even guess at what the penalty will be for your self-de­cep­tion. They only have to blow up once to undo all the good they ever did. One sin­gle time when you pray to God af­ter dis­cov­er­ing a lump, in­stead of go­ing to a doc­tor. That’s all it takes to undo a life. All the hap­piness that the warm thought of an af­ter­life ever pro­duced in hu­man­ity, has now been more than can­cel­led by the failure of hu­man­ity to in­sti­tute sys­tem­atic cry­onic preser­va­tions af­ter liquid ni­tro­gen be­came cheap to man­u­fac­ture. And I don’t think that any­one ever had that sort of failure in mind as a pos­si­ble blowup, when they said, “But we need re­li­gious be­liefs to cush­ion the fear of death.” That’s what black swan bets are all about—the un­ex­pected blowup.

Maybe you even get away with one or two black-swan bets—they don’t get you ev­ery time. So you do it again, and then the blowup comes and can­cels out ev­ery benefit and then some. That’s what black swan bets are all about.

Thus the difficulty of know­ing when it’s safe to be­lieve a lie (as­sum­ing you can even man­age that much men­tal con­tor­tion in the first place)—part of the na­ture of black swan bets is that you don’t see the bul­let that kills you; and since our per­cep­tions just seem like the way the world is, it looks like there is no bul­let, pe­riod.

So I would say that there is an eth­i­cal in­junc­tion against self-de­cep­tion. I call this an “eth­i­cal in­junc­tion” not so much be­cause it’s a mat­ter of in­ter­per­sonal moral­ity (al­though it is), but be­cause it’s a rule that guards you from your own clev­er­ness—an over­ride against the temp­ta­tion to do what seems like the right thing.

So now we have two kinds of situ­a­tion that can sup­port an “eth­i­cal in­junc­tion”, a rule not to do some­thing even when it’s the right thing to do. (That is, you re­frain “even when your brain has com­puted it’s the right thing to do”, but this will just seem like “the right thing to do”.)

First, be­ing hu­man and run­ning on cor­rupted hard­ware, we may gen­er­al­ize classes of situ­a­tion where when you say e.g. “It’s time to rob a few banks for the greater good,” we deem it more likely that you’ve been cor­rupted than that this is re­ally the case. (Note that we’re not pro­hibit­ing it from ever be­ing the case in re­al­ity, but we’re ques­tion­ing the epistemic state where you’re jus­tified in trust­ing your own calcu­la­tion that this is the right thing to do—fair lot­tery tick­ets can win, but you can’t jus­tifi­ably buy them.)

Se­cond, his­tory may teach us that cer­tain classes of ac­tion are black-swan bets, that is, they some­times blow up big­time for rea­sons not in the de­cider’s model. So even when we calcu­late within the model that some­thing seems like the right thing to do, we ap­ply the fur­ther knowl­edge of the black swan prob­lem to ar­rive at an in­junc­tion against it.

But surely… if one is aware of these rea­sons… then one can sim­ply redo the calcu­la­tion, tak­ing them into ac­count. So we can rob banks if it seems like the right thing to do af­ter tak­ing into ac­count the prob­lem of cor­rupted hard­ware and black swan blowups. That’s the ra­tio­nal course, right?

There’s a num­ber of replies I could give to that.

I’ll start by say­ing that this is a prime ex­am­ple of the sort of think­ing I have in mind, when I warn as­piring ra­tio­nal­ists to be­ware of clev­er­ness.

I’ll also note that I wouldn’t want an at­tempted Friendly AI that had just de­cided that the Earth ought to be trans­formed into pa­per­clips, to as­sess whether this was a rea­son­able thing to do in light of all the var­i­ous warn­ings it had re­ceived against it. I would want it to un­dergo an au­to­matic con­trol­led shut­down. Who says that meta-rea­son­ing is im­mune from cor­rup­tion?

I could men­tion the im­por­tant times that my naive, ideal­is­tic eth­i­cal in­hi­bi­tions have pro­tected me from my­self, and placed me in a re­cov­er­able po­si­tion, or helped start the re­cov­ery, from very deep mis­takes I had no clue I was mak­ing. And I could ask whether I’ve re­ally ad­vanced so much, and whether it would re­ally be all that wise, to re­move the pro­tec­tions that saved me be­fore.

Yet even so… “Am I still dumber than my ethics?” is a ques­tion whose an­swer isn’t au­to­mat­i­cally “Yes.”

There are ob­vi­ous silly things here that you shouldn’t do; for ex­am­ple, you shouldn’t wait un­til you’re re­ally tempted, and then try to figure out if you’re smarter than your ethics on that par­tic­u­lar oc­ca­sion.

But in gen­eral—there’s only so much power that can vest in what your par­ents told you not to do. One shouldn’t un­der­es­ti­mate the power. Smart peo­ple de­bated his­tor­i­cal les­sons in the course of forg­ing the En­light­en­ment ethics that much of Western cul­ture draws upon; and some sub­cul­tures, like sci­en­tific academia, or sci­ence-fic­tion fan­dom, draw on those ethics more di­rectly. But even so the power of the past is bounded.

And in fact...

I’ve had to make my ethics much stric­ter than what my par­ents and Jerry Pour­nelle and Richard Feyn­man told me not to do.

Funny thing, how when peo­ple seem to think they’re smarter than their ethics, they ar­gue for less strict­ness rather than more strict­ness. I mean, when you think about how much more com­pli­cated the mod­ern world is...

And along the same lines, the ones who come to me and say, “You should lie about the Sin­gu­lar­ity, be­cause that way you can get more peo­ple to sup­port you; it’s the ra­tio­nal thing to do, for the greater good”—these ones seem to have no idea of the risks.

They don’t men­tion the prob­lem of run­ning on cor­rupted hard­ware. They don’t men­tion the idea that lies have to be re­cur­sively pro­tected from all the truths and all the truth­find­ing tech­niques that threaten them. They don’t men­tion that hon­est ways have a sim­plic­ity that dishon­est ways of­ten lack. They don’t talk about black-swan bets. They don’t talk about the ter­rible naked­ness of dis­card­ing the last defense you have against your­self, and try­ing to sur­vive on raw calcu­la­tion.

I am rea­son­ably sure that this is be­cause they have no clue about any of these things.

If you’ve truly un­der­stood the rea­son and the rhythm be­hind ethics, then one ma­jor sign is that, aug­mented by this newfound knowl­edge, you don’t do those things that pre­vi­ously seemed like eth­i­cal trans­gres­sions. Only now you know why.

Some­one who just looks at one or two rea­sons be­hind ethics, and says, “Okay, I’ve un­der­stood that, so now I’ll take it into ac­count con­sciously, and there­fore I have no more need of eth­i­cal in­hi­bi­tions”—this one is be­hav­ing more like a stereo­type than a real ra­tio­nal­ist. The world isn’t sim­ple and pure and clean, so you can’t just take the ethics you were raised with and trust them. But that pre­tense of Vul­can logic, where you think you’re just go­ing to com­pute ev­ery­thing cor­rectly once you’ve got one or two ab­stract in­sights—that doesn’t work in real life ei­ther.

As for those who, hav­ing figured out none of this, think them­selves smarter than their ethics: Ha.

And as for those who pre­vi­ously thought them­selves smarter than their ethics, but who hadn’t con­ceived of all these el­e­ments be­hind eth­i­cal in­junc­tions “in so many words” un­til they ran across this Over­com­ing Bias se­quence, and who now think them­selves smarter than their ethics, be­cause they’re go­ing to take all this into ac­count from now on: Dou­ble ha.

I have seen many peo­ple strug­gling to ex­cuse them­selves from their ethics. Always the mod­ifi­ca­tion is to­ward le­nience, never to be more strict. And I am stunned by the speed and the light­ness with which they strive to aban­don their pro­tec­tions. Hobbes said, “I don’t know what’s worse, the fact that ev­ery­one’s got a price, or the fact that their price is so low.” So very low the price, so very ea­ger they are to be bought. They don’t look twice and then a third time for al­ter­na­tives, be­fore de­cid­ing that they have no op­tion left but to transgress—though they may look very grave and solemn when they say it. They aban­don their ethics at the very first op­por­tu­nity. “Where there’s a will to failure, ob­sta­cles can be found.” The will to fail at ethics seems very strong, in some peo­ple.

I don’t know if I can en­dorse ab­solute eth­i­cal in­junc­tions that bind over all pos­si­ble epistemic states of a hu­man brain. The uni­verse isn’t kind enough for me to trust that. (Though an eth­i­cal in­junc­tion against self-de­cep­tion, for ex­am­ple, does seem to me to have tremen­dous force. I’ve seen many peo­ple ar­gu­ing for the Dark Side, and none of them seem aware of the net­work risks or the black-swan risks of self-de­cep­tion.) If, some­day, I at­tempt to shape a (re­flec­tively con­sis­tent) in­junc­tion within a self-mod­ify­ing AI, it will only be af­ter work­ing out the math, be­cause that is so to­tally not the sort of thing you could get away with do­ing via an ad-hoc patch.

But I will say this much:

I am com­pletely unim­pressed with the knowl­edge, the rea­son­ing, and the over­all level, of those folk who have ea­gerly come to me, and said in grave tones, “It’s ra­tio­nal to do un­eth­i­cal thing X be­cause it will have benefit Y.”