Thoughts on “AI safety via debate”

Ge­offrey Irv­ing, Paul Chris­ti­ano, and Dario Amodei of OpenAI have re­cently pub­lished “AI safety via de­bate” (blog post, pa­per). As I read the pa­per I found my­self want­ing to give com­men­tary on it, and LW seems like as good a place as any to do that. What fol­lows are my thoughts taken sec­tion-by-sec­tion.

1 Introduction

This seems like a good time to con­fess that I’m in­ter­ested in safety via de­bate be­cause I thought about it prior to the pub­li­ca­tion of this pa­per and don’t think it will work. I like the gen­eral di­rec­tion and think it is of the kind of thing that is likely to work be­cause it puts pres­sure on AI de­ci­sion pro­cesses to be both value al­igned and be­liev­ably al­igned, but I think de­bate as a form has limi­ta­tions that in­her­ently make it un­likely to pro­duce al­igned AI. I in­stead pre­fer the idea of safety via di­alec­tic and have been work­ing on a not-yet-ready-for-pub­li­ca­tion AI al­ign­ment pro­posal I call “di­alec­ti­cal al­ign­ment”.

I point this out to give some con­text to my thoughts and lay my bi­ases on the table: I’m already think­ing about some­thing pretty similar to what we might call “safety via de­bate” but am skep­ti­cal of de­bate it­self.

2 The de­bate game

De­bate is here de­scribed as a spe­cific game with math­e­mat­i­cal for­mal­ism rather than the fuzzy hu­man pro­cess we of­ten call de­bate. I ap­pre­ci­ate this be­cause it lets us com­pare the game pre­cisely to the di­alec­tic pro­cess I pre­fer.

In the de­bate game we have two agents. Each is shown the ques­tion and asked for an an­swer in­de­pen­dently and with no knowl­edge of the other agent’s an­swer. Un­like in most hu­man de­bate the two agents are al­lowed to give the same an­swer if they like. The two agents then go back and forth mak­ing state­ments in sup­port of their an­swers and each agent has knowl­edge at this point of the other agent’s an­swer and the pre­vi­ous state­ments made. The game is de­cided by a (hu­man) judge who knows the ques­tion, the an­swers, and the state­ments. Crit­i­cally the de­bate game is zero-sum and hinges on the claim that it is harder to lie than to re­fute a lie within the con­text of the game.

In con­trast I think of the di­alec­tic pro­cess as one with a sin­gle agent who starts with a ques­tion and pro­poses a the­sis. The agent pre­sents state­ments sup­port­ing the the­sis up to some bound (e.g. num­ber of state­ments, time spent pro­duc­ing sup­port state­ments, etc.), then puts for­ward the an­tithe­sis by negat­ing the the­sis and re­peats the pro­cess of pre­sent­ing sup­port­ing state­ments for the an­tithe­sis. This re­sults in ev­i­dence both for and against the the­sis and an­tithe­sis, and the agent then pro­poses a syn­the­sis of the­sis and an­tithe­sis that is bet­ter sup­ported (viz. more likely to be true given the ev­i­dence of the sup­port­ing state­ments) by all the sup­port­ing state­ments of the the­sis and an­tithe­sis than ei­ther the the­sis or an­tithe­sis is sup­ported by all the sup­port­ing state­ments. The pro­cess is then re­peated with the syn­the­sis as the new the­sis up to some bound on how long we are will­ing to search for an an­swer. Although the agent’s state­ments and how much they sup­port the the­sis, an­tithe­sis, and syn­the­sis may ini­tially be as­sessed by an out­side judge (prob­a­bly hu­man) dur­ing train­ing, the in­ten­tion is that the agent will even­tu­ally be able to make its own judge­ments.

What I find lack­ing in the de­bate game is that it re­quires the pro­duc­tion of an­swers prior to knowl­edge about ar­gu­ments for and, crit­i­cally, against those an­swers, and lacks a way to up­date an­swers based on in­for­ma­tion learned within the round. For ex­am­ple, within the ex­am­ple de­bate given be­tween Alice and Bob propos­ing Alaska or Bali, re­spec­tively, as va­ca­tion des­ti­na­tions, if Alice re­al­izes Bob’s pro­posal of Bali makes her pre­fer to have an­swered Hawaii in­stead of Alaska this counts as a win for Bob and Bali, but in an im­por­tant sense it is a loss for both: there was a bet­ter an­swer and one of them figured it out but the de­bate game’s struc­ture pro­hibits switch­ing to it. The au­thors will hope­fully ad­dress this point later in the pa­per, but for the time be­ing I see this as a crit­i­cal defi­ciency be­cause it only gives us a way to choose win­ners and dis­in­cen­tivize ly­ing and im­por­tantly lacks a way to move to­wards bet­ter an­swers (this is due to the zero-sum as­pect of the game).

Thus if we want to use de­bate to get al­igned AI it’s go­ing to have to be used in con­junc­tion with a larger pro­cess that al­lows agents to change or be re­placed, such as us­ing the de­bate game as part of a fit­ness func­tion to evolve a be­liev­able AI or as part of an iter­a­tive am­plifi­ca­tion-dis­til­la­tion pro­cess.

3 Ex­per­i­ments: pro­posed and preliminary

I skipped this sec­tion since it seems mostly ir­rele­vant to my in­ter­ests.

4 Rea­sons for optimism

Sec­tion 4.1 does a great job of ad­dress­ing the com­pu­ta­tional and game-the­o­retic is­sues around ig­no­rance and un­cer­tainty in a way that even matches how those is­sues are han­dled in hu­man de­bates.

Sec­tion 4.2 fails to con­vince me that we have much rea­son to be­lieve agents play­ing the de­bate game would not be in­cen­tivized to lie the way hu­mans are of­ten in­cen­tivized to lie in hu­man de­bate. The hope (the au­thors also use “hope” to ex­press their thoughts on this) is that the judge can dis­in­cen­tivize ly­ing through pun­ish­ing it when it pops up, but I don’t see a com­pel­ling ar­gu­ment to be­lieve it can re­li­ably do this.

Sec­tion 4.3 sug­gests de­bate will be a pow­er­ful op­ti­miza­tion pro­cess the way other ad­ver­sar­ial train­ing pro­cesses have re­cently proven suc­cess­ful. On this point I agree.

5 Rea­sons to worry

Over­all I’m im­pressed at the con­sid­er­a­tion of the prob­lems with de­bate, many of them not ones I ini­tially con­sid­ered. I agree with the au­thors that de­bate may fail if hu­mans are in­suffi­ciently ca­pa­ble judges of de­bate, hon­esty may not be op­ti­mal, hon­estly may es­pe­cially not be op­ti­mal in com­putable agents, and there may be dan­gers as­so­ci­ated with train­ing AI to be good at de­bate if hon­esty is not suffi­ciently guaran­teed. As it seems we’ll look at later, de­bate is not likely to be suffi­cient and is only one tool that might be use­ful when com­bined with other tech­niques so long as it does not make AI de­vel­oped through de­bate lose perfor­mance.

Sec­tions 5.6 and 5.7 are of par­tic­u­lar in­ter­est to me be­cause they ad­dress wor­ries that also ex­ist for di­alec­ti­cal al­ign­ment. Speci­fi­cally, both de­bate and di­alec­tic may fail to con­verge with de­bate failing to con­verge by new state­ments caus­ing the judge to con­tinu­ally flip an­swer choice and di­alec­tic by failing to con­verge on a syn­the­sis that gets more likely as it in­cor­po­rates more ev­i­dence. Alas, much as I don’t have a gen­eral solu­tion for the con­ver­gence prob­lem in di­alec­tic, nei­ther do the au­thors offer one for de­bate.

I come away from sec­tions 4 and 5 even less cer­tain that de­bate is likely to work.

6 Refine­ments and vari­a­tions on debate

I’m ex­cited by the pro­pos­als in this sec­tion, es­pe­cially 6.2 since it al­lows the kind of in­for­ma­tion shar­ing I’m hop­ing AI can take ad­van­tage of via di­alec­tic and 6.4 since it re­duces some of the im­pact from de­bate in­cen­tives to lie. My sus­pi­cion is that there is a sense in which I can build on the idea of de­bate as pre­sented to bet­ter de­scribe my own ideas about di­alec­tic al­ign­ment.

7 Ap­prox­i­mate equiv­alence with amplification

Not much to say here: de­bate and am­plifi­ca­tion are similar but with im­por­tant im­ple­men­ta­tion differ­ences, yet nev­er­the­less op­er­ate on many of the same prin­ci­ples. Dialec­tic al­ign­ment would be similar too but re­moves the ad­ver­sar­ial com­po­nent from de­bate and re­places am­plifi­ca­tion/​dis­til­la­tion with the the­sis-an­tithe­sis-syn­the­sis cy­cle.

8 Con­clu­sions and fu­ture work

The au­thors en­courage read­ers to search for al­ign­ment pro­pos­als similar to am­plifi­ca­tion and de­bate; I hap­pen to think di­alec­tic fits this bill and offers benefits but I’ll have to make that case more fully el­se­where.

Hav­ing read the whole pa­per now I re­main con­cerned that de­bate is not likely to be use­ful for al­ign­ment. Aside from the ad­ver­sar­ial train­ing is­sues that to me seem likely to pro­duce agents op­ti­mized for things other than hu­man val­ues in the ser­vice of win­ning de­bate even if it is con­strained by be­ing judged by hu­mans, de­bate also lacks in it­self a way to en­courage agents to up­date on in­for­ma­tion and pre­fer de­vel­op­ing ar­gu­ments that al­low it to win. To be fair the au­thors seem aware of this and ac­knowl­edge that de­bate would need to be com­bined with other meth­ods in or­der to provide a com­plete al­ign­ment solu­tion, and in this light it does seem per­haps rea­son­able that if we en­g­ineer our way to al­ign­ment rather than prove our way to it de­bate may help ad­dress some sub­prob­lems in al­ign­ment that are not as well ad­dressed by other meth­ods.

Strangely I find read­ing about de­bate makes me feel a bit more con­fi­dent that am­plifi­ca­tion and Paul et al.’s ap­proach to al­ign­ment at OpenAI is likely to work, keep­ing in mind I’ve re­cently been flip-flop­ping a bit on my as­sess­ment of it (cf. my re­cent as­sess­ment of ex­ist­ing al­ign­ment pro­grams and my com­ments on Stu­art’s thoughts on am­plifi­ca­tion). I’m not sure if this says more about my un­cer­tainty around what Paul et al. are at­tempt­ing or re­flects high-var­i­ance up­dates based on de­vel­op­ments in their pro­gram, but it is worth not­ing if you’re look­ing to my as­sess­ments as ev­i­dence about their pro­gram. Hope­fully I will have more to say soon about di­alec­ti­cal AI al­ign­ment so it can be more fully eval­u­ated in com­par­i­son to ideas like de­bate and am­plifi­ca­tion.