The president didn’t die: failures at extending AI behaviour

A pu­ta­tive new idea for AI con­trol; in­dex here.

In a pre­vi­ous post, I con­sid­ered the is­sue of an AI that be­haved “nicely” given some set of cir­cum­stances, and whether we could ex­tend that be­havi­our to the gen­eral situ­a­tion, with­out know­ing what “nice” re­ally meant.

The origi­nal in­spira­tion for this idea came from the idea of ex­tend­ing the nice be­havi­our of “re­duced im­pact AI” to situ­a­tions where they didn’t nec­es­sar­ily have a re­duced im­pact. But it turned out to be con­nected with “spirit of the law” ideas, and to be of po­ten­tially gen­eral in­ter­est.

Essen­tially, the prob­lem is this: if we have an AI that will be­have “nicely” (since this could be a re­duced im­pact AI, I don’t use the term “friendly”, which de­notes a more proac­tive agent) given X, how can we ex­tend its “nice­ness” to ¬X? Ob­vi­ously if we can spec­ify what “nice­ness” is, we could just re­quire the AI to do so given ¬X. There­fore let us as­sume that we don’t have a good defi­ni­tion of “nice­ness”, we just know that the AI has that given X.

To make the prob­lem clearer, I chose an X that would be un­de­ni­ably pub­lic and have a large (but not over­whelming) im­pact: the death of the US pres­i­dent on a 1st of April. The pub­lic na­ture of this event pre­vents us­ing ap­proaches like ther­mo­dy­namic mir­a­cles to define coun­ter­fac­tu­als.

I’ll be pre­sent­ing a solu­tion in a sub­se­quent post. In the mean­time, to help bet­ter un­der­stand the is­sue, here’s a list of failed solu­tions:

First Failure: maybe there’s no problem

Ini­tially, it wasn’t clear there was a prob­lem. Could we just ex­pect nice­ness to ex­tend nat­u­rally? But con­sider the fol­low­ing situ­a­tion: as­sume the vice pres­i­dent is a war­mon­ger, who will start a nu­clear war if ever they get into power (but is oth­er­wise harm­less).

Now as­sume the nice AI has the con­di­tional ac­tion crite­ria: “if the vice pres­i­dent ever be­comes pres­i­dent, launch a coup”. This is safe, it can be ex­tended to the ¬X situ­a­tion in the way we want.

How­ever, con­di­tion­ing on X, that crite­ria is equiv­a­lent with “launch a coup on the 2nd of April”. And if the AI has that crite­ria, then ex­tend­ing it to ¬X is highly non-safe. This illus­trates that there is a real prob­lem here—the coup ex­am­ple is just one of the myr­iad of po­ten­tial is­sues that could arise, and we can’t pre­dict them all.

Se­cond failure: don’t con­di­tion on X

Maybe the trick could be pre­vent­ing the AI from con­di­tion­ing on X (for any­thing)? If the AI it­self can’t tell the differ­ence be­tween X and ¬X, wouldn’t its nice be­havi­our ex­tend?

But this doesn’t help. The AI could merely con­di­tion on things tightly cou­pled with X. Like news re­ports of the pres­i­dent’s demise, or a con­junc­tion of other events al­most im­pos­si­ble un­der ¬X.

Third failure: dis­be­lieve the truth

In this de­sign, the AI sim­ply as­sumes that X hap­pens, what­ever the ev­i­dence. There­fore it would stay nice, what­ever hap­pened.

But this re­sults in an AI who’s be­liefs are strongly de­cou­pled with re­al­ity. The AI must be­lieve that there is a vast con­spir­acy con­ceal­ing the pres­i­dent’s death; the more it in­ves­ti­gates this con­spir­acy, and fails to find it, the more vast and pow­er­ful it be­comes. What if you show it un­de­ni­able, im­pos­si­ble-to-fake-by-the-laws-of-physics ev­i­dence of the pres­i­dent’s demise? Then it starts to be­lieve that the laws of physics are all wrong.

Though the AI is “nice”, it’s nice­ness is ap­pro­pri­ate for a world of in­visi­ble con­spir­a­cies and al­ter­nate physics—not for the real world. It’s clear how such an AI could start do­ing crazy and dan­ger­ous things as its view of re­al­ity starts to di­verge.

Fourth failure: hack the models

This ex­am­ple is due to Daniel Dewey. He imag­ined that the AI would main­tain a full col­lec­tion of mod­els and up­date them nor­mally. But be­fore mak­ing a pre­dic­tion, each model would get hacked to ar­tifi­cially set ¬X to X.

To see how this could fail, imag­ine all ev­i­dence about the world is stored in a gi­ant vec­tor of data. Th­ese vec­tors are cal­ibrated cor­rectly, then ¬X is flipped to X. The prob­lem is that, while that is small change to the vec­tor, it could be a huge change to the world that it en­codes.

For in­stance, let’s go back to the war­mon­ger­ing vice pres­i­dent, and as­sume that nu­clear war is in­evitable, given X. Then the AI will walk around in a peace­ful world, up­date its mod­els—and act as if it was in a nu­clear waste­land, be­cause those are the only pos­si­bil­ities, given X. Essen­tially, the AI will move through our uni­verse, har­vest­ing in­for­ma­tion that would in­form its ac­tions in a par­allel uni­verse—and act­ing as if it ex­isted there in­stead of here.

For in­stance, it could wan­der into a flower show where some­one is talk­ing about difficul­ties grow­ing roses in south­ern Que­bec. It adds this data to its vec­tor, not­ing that the soil there must be a bit un­suit­able to plant growth. It there­fore con­cludes that it must write to the (non-ex­is­tent) Third God-Em­peror of Amer­ica and ad­vise it to give up on the Que­bec Angli­can Pro­tec­torate, which must be mis­re­port­ing their agri­cul­ture out­put, given this data.

It’s in­ter­est­ing to con­trast this AI with the pre­vi­ous one. Sup­pose that the nu­clear war fur­ther im­plies that Paris must be a smok­ing crater. And now both AIs must walk around a clearly bustling and in­tact Paris. The dis­be­liev­ing AI must con­clude that this is an elab­o­rate ruse—some­one has hid­den the crater from its senses, put up some fake build­ing, etc… The model-hack­ing AI, mean­while, acts as if it’s in a smoulder­ing crater, with the gen­uine Paris giv­ing it in­for­ma­tion as to what it should do: it sees an in­tact army bar­racks, and starts dig­ging un­der the “rub­ble” to see if any­thing “re­mains” of that bar­racks.

It would be in­ter­est­ing to get Robin Han­son to try and rec­on­cile these AIs’ be­liefs ^_^

Fifth failure: Bayes nets and decisions

It seems that a Bayes net would be our sal­va­tion. We could have de­pen­dent nodes like “war­mon­ger­ing pres­i­dent”, “nu­clear war”, or “flower show”. Then we could re­quire that the AI makes its de­ci­sion de­pen­dent only on the states of these de­pen­dent nodes. And never on the origi­nal X/​¬X node.

This seems safe—af­ter all, the AI is nice given X. And if we re­quire the AI’s de­ci­sions be de­pen­dent only on sub­or­di­nate nodes, then it must be nice de­pen­dent on the sub­or­di­nate nodes. There­fore X/​¬X is ir­rele­vant, and the AI is always nice.

Ex­cept… Con­sider what a “de­ci­sion” is. A de­ci­sion could be some­thing sim­ple, or it could be “con­struct a sub AI that will es­tab­lish X ver­sus ¬X, and do ‘blah’ if X, and ‘shmer’ if ¬X”. That’s a perfectly ac­cept­able de­ci­sion, and could be made con­di­tional on any (or all) of the sub­or­di­nate nodes. And if ‘blah’ is nice while ‘shmer’ isn’t, we have the same prob­lem.

Six failure: Bayes nets and un­nat­u­ral categories

OK, if de­ci­sions are too gen­eral, how about val­ues for wor­lds? We take a lot of nodes, sub­or­di­nate to X/​¬X, and re­quire that the AI define its util­ity or value func­tion purely in terms of the states of these sub­or­di­nate nodes. Again, this seems safe. The AI’s value func­tion is safe given X, by as­sump­tion, and is defined in terms of sub­or­di­nate nodes that “screen off” X/​¬X.

And that AI is in­deed safe… if the sub­or­di­nate nodes are sen­si­ble. But they’re only sen­si­ble be­cause I’ve defined them us­ing terms such as “nu­clear war”. But what if a node is “nu­clear war if X and peace in our time if ¬X”? That’s a perfectly fine defi­ni­tion. But such nodes mean that the value func­tion given ¬X need not be safe in any way.

This is some­what con­nected with the Grue and Bleen is­sue, and ad­dress­ing that is how I’ll be hop­ing to solve the gen­eral prob­lem.