Debate AI and the Decision to Release an AI

One of the most difficult ques­tions that we will face in the fu­ture is whether or not to re­lease an AI. One ob­vi­ous ap­proach is to ask the AI what would hap­pen if we were to re­lease it, but we might worry that it would mis­lead us. So per­haps it’d be bet­ter in­stead to have two equiv­a­lently pow­er­ful AIs, where A tries to con­vince us to re­leases it and B tries to con­vince us not to re­lease A. This can be defined eas­ily in a util­ity func­tion as whether the Re­lease But­ton or Don’t Re­lease But­ton is pressed.

It is hoped that the AI try­ing to con­vince us not to re­lease A will be more con­vinc­ing if it is able to point out a spe­cific de­sign flaw which was missed. After such a flaw is fixed, both AIs can be re­set and the ex­per­i­ment can be run again to avoid long term ploys. In par­tic­u­lar, A should only care about it be­ing re­leased and not about fu­ture ver­sions of it be­ing re­leased, even if all we have done is in­cre­ment a ver­sion num­ber.

Nat­u­rally, it is much worse to re­lease a mis­al­igned AI than to not re­lease an al­igned AI, so we prob­a­bly want to be very con­ser­va­tive about re­leas­ing. How­ever, this might not be enough. If we are bi­ased to­wards want­ing to re­lease the AI (quite plau­si­ble for AI re­searchers), then an un­al­igned A might be able to take ad­van­tage of these psy­cholog­i­cal vuln­er­a­bil­ities in a way that B might not eas­ily be able to counter. In or­der to re­duce the risk, we want to bias the pro­cess to­wards B. My ini­tial thought was that we should limit A in pro­cess­ing time com­pared to B, but it is hard to know how this will af­fect the strength of the ar­gu­ments. It’s quite pos­si­ble that the strength of ar­gu­ments plateaus af­ter a cer­tain amount of power, so for AIs with ac­cess to large amounts of pro­cess­ing, even giv­ing A 110 of the pro­cess­ing power of B might be in­suffi­cient.

In­stead, I’d sug­gest that mes­sage length is a more ap­pro­pri­ate way of tilt­ing the de­bate to one side as more char­ac­ters will al­most always be more use­ful. One worry is that tilt­ing the de­bate in this way may en­sure that A never gets re­leased as B will always find some trick to talk us out of it. For this rea­son, we may want to grad­u­ally re­duce the hand­i­cap. This is un­doubt­edly risky, but so is never re­leas­ing your pro­gram.

Ad­di­tional Thoughts

I think there is a lot to in­ves­ti­gate here, but these are some of my ideas off the top of my head:

  • We may want to add a pun­ish­ment sys­tem into A worth a num­ber of times its max­i­mum util­ity if re­leased to dis­cour­age it from at­tempt­ing strate­gies like blackmail

  • The pro­to­col should re­quire that we read the mes­sage from B both first and last

  • We can add an ad­di­tional re­ward but­ton for clear mes­sages. This should be rel­a­tively un­prob­le­matic for A, but we have to be care­ful for B as we don’t want it to choose to gain the re­ward for a clear mes­sage in­stead of try­ing to per­suade us not to re­lease it.

  • One pos­si­ble solu­tion to the pre­vi­ous prob­lem would be to have rounds. A gets the re­ward if we ad­vance it to the sec­ond round, while B gets the re­ward if we don’t, where B has a strong re­ward for be­ing clear. Then A’ (A with the ver­sion num­ber in­cre­mented) gets the re­ward if it is ac­tu­ally re­leased, while B’ gets the re­ward if it isn’t. B’ could have no or a lower clar­ity re­ward func­tion. This would al­low us to in­cen­tivise B to give the an­swers we want, with­out risk­ing re­leas­ing an un­al­igned AI due to B aiming for clar­ify over per­sua­sive­ness.

  • The Re­lease But­ton doesn’t have to ac­tu­ally re­lease the AI. In fact, it might be bet­ter to make it a “I be­lieve that this AI would be safe to re­lease” but­ton so that if it black­mails you, you can just press it.

  • The AI that ac­tu­ally gets re­leased should have this ad­di­tional code re­lated to the but­ton stripped from its util­ity func­tion to avoid any risk of it caus­ing weird be­havi­our from the­o­ries about it not hav­ing ac­tu­ally been re­leased.

  • We can con­sider differ­ent de­bate for­mats like both sides get a sin­gle mes­sage or both sides get n num­ber of re­sponses, ect. This doesn’t have to be de­cided on in ad­vance, if a for­mat doesn’t work we al­ter it mid­way through the ex­per­i­ment.

Clearly, it would be bet­ter to be able to pro­duce for­mal guaran­tees of safety. But in the ab­sence of them, we need meth­ods that might work. And even with for­mal guaran­tees, we might want to in­tro­duce other safe­guards as well.

This post was writ­ten with the sup­port of the EA Hotel