The AI, the best human advisor

A pu­ta­tive new idea for AI con­trol; in­dex here.

All my ideas on re­duced im­pact AI fol­low the same for­mat: some re­stric­tion to en­sure a re­duced im­pact (I gen­er­ally use some­thing like “could a fu­ture AI de­tect this one ex­isted”?) and some “trick” to al­low a way of still achiev­ing a larger im­pact. Fur­ther more, there is always a “nice” way of achiev­ing this larger im­pact (do the thing we re­ally want you to do, and only that!), so that we hope the AI can be forced into that be­havi­our. Ex­am­ples of “tricks” are:

In this post, I’ll pre­sent an­other var­i­ant—one that I think is the most in­ter­est­ing so far. I won’t change the re­duced im­pact con­di­tion, but will pre­sent a differ­ent “trick”.

AI pre­tends to be human

As in this post, imag­ine that we have a re­duced-im­pact AI in a box that con­ceals it from us. The AI has some goal, and will out­put some data to fur­ther that goal, then turn it­self off. And then we’ll shake the box around or some­thing, so we can’t be sure that the AI was turned on.

There are two is­sues to deal with, here: a) The mes­sage it­self is strong ev­i­dence that the AI was turned on, and b) if we ex­clude the mes­sage from the re­duced im­pact, the mes­sage be­comes dan­ger­ous.

To deal with a), first of all. This ap­proach gets much eas­ier if we have a whole brain em­u­la­tion (WBE). Then we can house the WBE in the box with the AI. If the AI doesn’t send out a mes­sage, then the WBE sends out is own mes­sage in­stead. There are var­i­ants we can try here if we don’t have WBE—maybe some hu­man vol­un­teer will­ing to ei­ther die or take am­ne­sia pills at the end of the ex­per­i­ment.

So, re­ceiv­ing a co­her­ent mes­sage is not a sur­pris­ing thing to come out the out­put chan­nel. How­ever, an AI mo­ti­vated strongly by re­duced im­pact, will sim­ply seek to imi­tate as pre­cisely as pos­si­ble what the WBE would send. We still need a trick to get some­thing use­ful.

AI pre­tends to be BEST human

Imag­ine that we wanted the AI to give us in­vest­ment ad­vice to make us rich. Say, a bi­nary task: get us ten mil­lion dol­lars over the course of the year.

Now, this is the kind of thing that the WBE could con­ceiv­ably achieve. We could run the WBE fast, for three sub­jec­tive weeks, and it has a chance of giv­ing us ad­vice that would get us that rich. As­sume that it only has 0.1% chance of do­ing so. The AI, on the other hand, could pro­duce ad­vice that makes us that rich with al­most cer­tainty.

There­fore if we get ad­vice, fol­low it, and end up at least ten mil­lion dol­lars richer, we can con­clude that the odds up­date of the AI hav­ing given us the ad­vice (rather than the WBE do­ing so) is about 1:1000.

And this gives us the solu­tion to b). The “trick”, the es­cape clause, is by con­di­tion­ing on the out­come of our in­vest­ments. So we’re ask­ing: given that we have be­come rich (and given the other facts about the uni­verse in 50 years time) what ad­di­tional ev­i­dence is there that the AI was turned on?

Now, a sen­si­ble WBE is al­most cer­tain to not come up with a scheme that de­stroys the uni­verse with self-tiling agents—or, in­deed, does any­thing too crazy. We could in­struct the WBE to this effect, and we’re only run­ning it for three weeks, which pre­vents the kind of drifts that you see in more ex­treme situ­a­tions. Thus a mes­sage from the WBE, no mat­ter how likely it is to make us rich, is al­most cer­tain to not lead to till­ing the uni­verse with self-repli­cat­ing nano-bots.

So here, the AI merely has to come up with a mes­sage that could have come from the WBE, with at least 0.01% prob­a­bil­ity, and that would in­deed make us rich. And, there­fore, the mes­sage can­not con­tain se­cret code that would op­ti­mise the uni­verse.

Note that we need that the goal we are look­ing for be not too ex­treme in im­pact (the higher the im­pact of the goal, the more lat­i­tude the AI has to have im­pact it­self, “hid­ing” in­side the goal’s im­pact), and we should re­strict to event of prob­a­bil­ity, say, greater than 110,000 (as when the prob­a­bil­ity gets too small, odd events start hap­pen­ing).

It seems that this scheme could re­sult in AIs that always give the very best pos­si­ble ad­vice a hu­man (or WBE) could ever give, which would be a very use­ful re­sult.

Whose prob­a­bil­ity?

We have a prob­a­bil­ity dis­tri­bu­tion over events. The AI it­self will have a prob­a­bil­ity dis­tri­bu­tion over events. The fu­ture hy­po­thet­i­cal su­per-su­per-AI it is us­ing to com­pute re­duced im­pact has a prob­a­bil­ity dis­tri­bu­tion over events—and the AI has a prob­a­bil­ity dis­tri­bu­tion over that prob­a­bil­ity dis­tri­bu­tion. If all of them agree on the prob­a­bil­ity of us get­ting richer (given WBE ad­vice and given not), then this scheme should work.

If they dis­agree, there might be prob­lems. A more com­plex ap­proach could di­rectly take into ac­count the di­ver­gent prob­a­bil­ity es­ti­mates; I’ll think of that and re­turn to the is­sue later.