Simplified preferences needed; simplified preferences sufficient

AI sci­en­tists at EvenDeep­erMind: “Hey ev­ery­one! We have a de­vel­oped a low-im­pact AI!”

AI policy peo­ple at OpenFu­ture­ofEx­is­ten­tialAI: “Fan­tas­tic! What does it do?”

AI sci­en­tists: “It’s limited to an­swer­ing ques­tions, and it has four pos­si­ble out­puts, , , , and .”

AI policy: “What ex­actly do these out­puts do, btw?”

AI sci­en­tists: “Well, turns a green light on, turns a red light on, starts a nu­clear war, and turns a blue light one.”

AI policy: “Starts a nu­clear war?!?!?”

AI sci­en­tists: “That or turns a yel­low light on; I can never re­mem­ber which...”.

Against purely for­mal defi­ni­tions of im­pact measure

It’s “ob­vi­ous” that an AI that starts a nu­clear war with one of its four ac­tions, can­not be con­sid­ered a “low-im­pact” agent.

But what about one that just turned the yel­low light on? Well, what about the util­ity func­tion , which was if there was no yel­low lights in that room dur­ing that hour, but was if there was a yel­low light. For that util­ity func­tion, the ac­tion “start a nu­clear war” is the low im­pact ac­tion, and even en­ter­tain­ing the pos­si­bil­ity of turn­ing on the yel­low light is an abom­i­na­tion, you mon­ster.

To which you should an­swer ” is a very stupid choice of util­ity func­tion”. In­deed it is. But it is a pos­si­ble choice of util­ity func­tion, so if we had an AI that was some­how “low-im­pact” for all util­ity func­tions, it would be low-im­pact for .

There are less ar­tifi­cial ex­am­ples than ; a friendly util­ity func­tion with a high dis­count rate would find any de­lay in­tol­er­able (“by not break­ing out and op­ti­mis­ing the world im­me­di­ately, you’re mur­der­ing thou­sands of peo­ple in in­tense agony, you mon­ster!”).

Ab­stract low-impact

This is has been my re­cur­rent ob­jec­tion to many at­tempts to for­mal­ise low-im­pact in ab­stract terms. We live in a uni­verse in which ev­ery ac­tion is ir­re­versible (damn you, sec­ond law!) and the con­se­quences ex­pand at light-speed across the cos­mos. And yet most ver­sions of low im­pact—in­clud­ing my own at­tempts—re­volve around some mea­sure of “keep­ing the rest of the uni­verse the same, and only chang­ing this tiny thing”.

For this to make sense, we need to clas­sify some de­scrip­tors of the world as “im­por­tant”, and oth­ers as “unim­por­tant”. And we fur­ther need to es­tab­lish what counts as “small” change to an “im­por­tant” fact. You can see this as as­sign­ing util­ity func­tions to the val­ues of the im­por­tant de­scrip­tors, and cap­tur­ing low im­pact as “only change to these util­ity func­tions in ”.

But you ab­solutely need to define , and this has to be defi­ni­tion that cap­tures some­thing of hu­man val­ues. This pa­per mea­sures low-im­pact by pre­serv­ing vases and pe­nal­is­ing “ir­re­versible” changes. But ev­ery change is ir­re­versible, and what about pre­serv­ing the con­vec­tion cur­rents in the room rather than the pointless vases? (“you mon­ster!”).

So defin­ing that are com­pat­i­ble with hu­man mod­els of “low-im­pact”, is ab­solutely es­sen­tial to get­ting the whole thing to work. Ab­stractly con­sid­er­ing all util­ity func­tions (or all util­ity func­tions defined in an ab­stract ac­tion-ob­ser­va­tion sense) is not go­ing to work.

Note that of­ten the defi­ni­tion of can be hid­den in the as­sump­tions of the model, which will re­sult in prob­lems if those as­sump­tions are re­laxed or wrong.

The gen­eral in­tu­itive disagreement

The ob­jec­tion I made here ap­plies also to con­cepts like cor­rigi­bil­ity, do­mes­tic­ity, value-learn­ing, and similar ideas (in­clud­ing some ver­sions of toolAI and Or­a­cles). All of these need to des­ig­nate cer­tain AI poli­cies as “safe” (or safer) and other as dan­ger­ous, and draw the line be­tween them.

But, in my ex­pe­rience, this defi­ni­tion can­not be done in an ab­stract way; there is no such thing as a gen­er­ally low-im­pact or cor­rigible agent. Defin­ing some sub­set of what hu­mans con­sider cor­rigible or tool-like, is an es­sen­tial re­quire­ment.

Now, peo­ple work­ing in these ar­eas don’t of­ten dis­agree with this for­mal ar­gu­ment; they just think it isn’t that im­por­tant. They feel that get­ting the right for­mal­ism is most of the work, and find­ing the right is eas­ier, or just a sep­a­rate bolt-on that can be added later.

My in­tu­ition, formed mainly by my many failure in this area, is that defin­ing the is ab­solutely crit­i­cal, and is much harder than the rest of the prob­lem. Others have differ­ent in­tu­itions, and I hope they’re right.

Strictly eas­ier than friendliness

The prob­lem of find­ing a suit­able is, how­ever, strictly eas­ier than defin­ing a friendly util­ity func­tion.

This can be seen in the fact that there are huge dis­agree­ments about moral­ity and val­ues be­tween hu­mans, but much lower dis­agree­ment on what an Or­a­cle, a low-im­pact, or a cor­rigible agent should do.

“Don’t need­lessly smash the vases, but the con­vec­tion cur­rents are not im­por­tant” is good ad­vice for a low im­pact agent, as agreed upon by peo­ple from all types of poli­ti­cal, moral, and cul­tural per­sua­sions, in­clud­ing a wide va­ri­ety of plau­si­ble imag­i­nary agents.

Thus defin­ing is eas­ier than com­ing up with a friendly util­ity func­tion, as the same low-im­pact/​cor­rigi­bil­ity/​do­mes­tic­ity/​etc. is com­pat­i­ble with many differ­ent po­ten­tial friendly util­ity func­tions for differ­ent val­ues.