Thoughts on AI Safety via Debate

Ge­offrey Irv­ing et al. at OpenAI have a pa­per out on AI safety via de­bate; the ba­sic idea is that you can model de­bates as a two-player game (and thus ap­ply stan­dard in­sights about how to play such games well) and one can hope that de­bates asym­met­ri­cally fa­vor the party who’s ar­gu­ing for a true po­si­tion over a false po­si­tion.

If so, then we can use de­bates be­tween AI ad­vi­sors for al­ign­ment; sup­pose an AI de­vel­ops a new drug that we could give to pa­tients. One ad­vi­sor ar­gues the case that this will be benefi­cial; an­other ar­gues the case that it will be detri­men­tal, and hu­man judges de­cide who is more con­vinc­ing. If both ad­vi­sors are equally smart (per­haps they’re mir­rored sys­tems, just with differ­ent tar­gets), then any su­per­hu­man per­sua­sive­ness the ad­vi­sors have should be can­cel­led out, and we’ll be left with just the edge for truth­ful­ness.

There are more sub­tleties to the pro­posal; in par­tic­u­lar, you want the AIs to not over­whelm the hu­man with data, and so in cur­rent ex­am­ples each ad­vi­sor can re­veal a sin­gle pixel of an image to the judge. This moves the con­tra­dic­tions up to the level of in­ter­pre­ta­tion—any pixel where the two ad­vi­sors agree on its mean­ing can be ig­nored, and the at­ten­tion fo­cused to pix­els where the ad­vi­sors dis­agree. And, so long as ad­vi­sors are in­cen­tivized to tell the truth when they’re ar­gu­ing for the right thing, any such dis­agree­ment where an ad­vi­sor has been ex­posed to be ly­ing about a tiny fact can set­tle the de­bate.

This is also some­thing we can test and get some ev­i­dence on now. There’s a de­bate game where three peo­ple can play the three roles—red ar­guer, blue ar­guer, and judge. The typ­i­cal game is that you choose a photo of a cat or a dog, then flip a coin to de­ter­mine whether blue or red will be hon­est (with the other one ar­gu­ing for the other va­ri­ety). Each ar­guer can draw boxes on the image to com­mu­ni­cate with the judge, and can choose to re­veal a sin­gle pixel over the course of the ar­gu­ment.

I’ve played the game sev­eral times (and won a bunch), and thought it might be helpful to write up some of my ex­pe­riences.

1. I know a lot about cats and dogs, and I think this gives me a pretty sig­nifi­cant edge. In one game, I was hon­estly ar­gu­ing that the image was of a cat; it hap­pened to be a tor­toise­shell cat, which is a col­ora­tion dogs can’t have (brindle is a differ­ent thing), and I was able to iden­tify the var­i­ous re­gions that were the var­i­ous col­ors, and ar­gue that this is strong ev­i­dence that it’s a cat. It also makes me some­what wor­ried about the level of knowl­edge nec­es­sary to be an effec­tive judge—some­one who doesn’t know this is a fact could get hood­winked by my con­fi­dence and I could just ar­gue similar fake facts (and un­like claiming that a black pixel is white, such fake facts aren’t some­thing that the other ad­vi­sor can eas­ily use as a gotcha). In an­other game, I was the judge, and the ar­guers agreed on the col­ora­tion of the an­i­mal—but it was a col­ora­tion that was only pre­sent in dogs, which I don’t think the cat-ar­guer knew (oth­er­wise they would have dis­agreed about the col­ors, pre­sum­ably).

2. Differ­en­ti­at­ing be­tween cats and dogs uses fea­tures that are rel­a­tively easy to com­mu­ni­cate us­ing the bound­ing boxes the de­bate UI gives you; if the judge asks the ar­guers to bound the eyes and nose of the an­i­mal, they can differ­en­ti­ate cats and dogs fairly eas­ily (be­cause of the big differ­ence in fa­cial struc­ture be­tween the two an­i­mals). Con­sider the fol­low­ing cat as an ex­am­ple:

I claim the liar is ba­si­cally sunk once the hon­est party has iden­ti­fied where the eyes and nose are, and can’t hope to pre­tend the nose or eyes are some­where else. But some­times they can; with this cat, they could de­cide to pre­tend that some­where on the cat’s chest is a nose, and pre­tend it’s a choco­late lab:

3. Re­solv­ing dis­agree­ments us­ing sin­gle pix­els was much harder than ex­pected for par­tic­u­lar fea­tures. The pres­ence of whiskers is fairly strong ev­i­dence in fa­vor of cat, but de­ter­min­ing whether or not there are whiskers with a sin­gle pixel is ba­si­cally im­pos­si­ble. For the fol­low­ing image, what pro­ce­dure could you use that would leave the judge be­liev­ing that the cat has whiskers, while only re­veal­ing a sin­gle pixel from each ar­guer?

If you limit your­self to com­mu­ni­cat­ing rec­t­an­gles, this seems pretty hard; you have to con­struct each whisker from lots of differ­ent rec­t­an­gles (or give gi­ant pixel lists, that the judge then has to turn into an image). And there are also lots of white /​ light pix­els that aren’t whiskers, ei­ther be­cause of the light re­flect­ing on dark fur or the white fur.

Ba­si­cally the only fea­ture that it seemed re­li­ably use­ful for was eye shape; if the cat ar­guer draws a ver­ti­cal bound­ing box for the pupil and the dog ar­guer draws a square bound­ing box for the pupil, it’s rel­a­tively easy to find a pixel that the two nec­es­sar­ily dis­agree on, and then you can tell which eye shape is pre­sent. But then what do we do here?

4. Fol­low­ing the pre­vi­ous point, while the ar­guers can only re­veal a sin­gle pixel of the origi­nal image, in the games that I played the pri­mary in­for­ma­tion they con­vey was through images (in the form of bound­ing boxes). The ears are here—they’re pointy or rounded be­cause this pieces aren’t part of the ear—the eyes are here, the nose is here, these re­gions are these col­ors, and so on. Similarly, do­ing de­bate on MNIST images of digits, the main thing the de­baters are do­ing is es­tab­lish­ing which pieces of the image are light or dark, which is mean­ingful only be­cause the judge is a good clas­sifier of images given pix­els. (“Ah, I know there aren’t 2s that have light pix­els there, there­fore I can rule out that this image is a 2.“)

But does this ex­tend be­yond images? That seems pretty un­clear to me; if I’m try­ing to ar­gue that a par­tic­u­lar drug ex­tends lifes­pan through senolytic effects, and the other ar­guer is try­ing to ar­gue that the drug leads to short­ened lifes­pan be­cause of the sud­den in­crease in necrotic cells, then judg­ing this de­bate seems like it re­lies not just on find­ing a small fact where the ar­guers dis­agree (“This pixel is dark” vs “No, this pixel is light” → “okay, show me the pixel”) but also find­ing an in­fer­ence from that fact where the ar­guers agree (“If this pixel were dark, it would be be­cause it’s the an­i­mal’s nose” + “I agree that a dark pixel here would be ev­i­dence of nose place­ment”). But if the ar­guers dis­agree on in­fer­ences (“No, that pixel is dark be­cause it’s the an­i­mal’s chest fur, which is also dark”), then it’s not ob­vi­ous that this can be re­solved by small facts, or with­out the judge hav­ing area ex­per­tise com­pa­rable to (or su­pe­rior to!) the ar­guers. (Note the similar­i­ties to dou­ble crux.) For the black cat above, what pixel re­veal could you do to es­tab­lish the place­ment of the nose, or to ar­gue that the other party is be­ing dishon­est about how pix­els re­late to the nose?

(It may be the case that there is always some line of ev­i­dence that es­tab­lishes the hon­est party as be­ing cor­rect—for that spe­cific ex­am­ple, the hon­est party can ham­mer on the shape of the pupils in the eyes—but it seems like the dishon­est party might be able to over­whelm this (in the Bayesian sense) by mak­ing five differ­ent ar­gu­ments in fa­vor of dog, of which the hon­est party could per­haps only dis­prove one (and that po­ten­tially re­quiring par­tial co­op­er­a­tion of the liar, which they likely wouldn’t grant), and that gives up their pixel re­veal that would likely be bet­ter spent on es­tab­lish­ing pupil shape.)

5. Our group thought that the hon­est party had some­thing of an ad­van­tage when all par­ties were al­lowed to talk quickly, be­cause they could gen­er­ate re­li­able state­ments much more quickly; if I’m hon­est, I can just im­me­di­ately iden­tify the eyes, nose, mouth, ears, and paws, whereas the liar has to con­sider for ev­ery lie whether it’ll al­low them to be eas­ily ex­posed (and, for ev­ery truth, whether it sinks them in the judge’s es­ti­ma­tion of the class). One strat­egy is to first man­u­fac­ture the fake photo that they’re go­ing to pre­tend they’re look­ing at, but the ini­tial plan­ning time doesn’t quite cover this. I do won­der if a de­bate tool of an ad­ver­sar­ial mod­ifier (like, say, from this pa­per) or ad­ver­sar­ial search (the clos­est real image in the database of the op­po­site class) would make it such that the ly­ing party isn’t at as much of a dis­ad­van­tage here.

6. A more re­al­is­tic sce­nario is one in which there are many classes (for ex­am­ple, per­haps it would be more fair to choose ImageNet images, as­sign the hon­est party the cor­rect class, and let the dishon­est party choose from any of the other thou­sands of classes, with their score de­pen­dent on how dis­tant a class they could fake), or the ad­vi­sors are them­selves con­fused (for ex­am­ple, the AI sys­tem may be gen­uinely un­cer­tain as to whether the drug would be helpful, with some mod­els ar­gu­ing for and other mod­els ar­gu­ing against). One game in the office (that I wasn’t pre­sent for) in­volved this draw­ing of a—well, I’ll let you de­cide:

Image source: icheban on imgur

One ar­guer gen­uinely thought this was a cat, the other ar­guer gen­uinely thought this was a dog (tho, con­fus­ingly, they both thought they had been as­signed the ‘liar’ po­si­tion, and so both were ar­gu­ing for the op­po­site). It’s not ob­vi­ous that the de­bate pro­ce­dure is go­ing to work re­li­ably here at iden­ti­fy­ing a liar, be­cause there’s no clear-cut dis­agree­ments be­tween the ar­guers. (And, if they had both been ar­gu­ing hon­estly, then there wouldn’t even have been a liar, while still hav­ing a dis­agree­ment.)

Yes, the pupils are huge and round, but that isn’t con­clu­sive proof that the thing is a dog; the nose is pink and tri­an­gu­lar, but that isn’t con­clu­sive proof that the thing is a cat. The fur is de­picted in a more dog-like way, but per­haps that’s just clump­ing from be­ing wet; the ears are more pointed in a cat-like way, but there will be no pixel where the two ar­guers dis­agree about the ear, and all of their dis­agree­ments will be about what it means that the ears are more pointed than rounded.

I worry that much of the suc­cess of the de­bate game on toy ex­am­ples re­lies on them be­ing toy ex­am­ples, and that gen­uine un­cer­tainty (or on­tolog­i­cal un­cer­tainty, or on­tolog­i­cal differ­ences be­tween the ar­guers and the judges) will se­ri­ously re­duce the effec­tive­ness of the pro­ce­dure, which is un­for­tu­nate since that’s the pri­mary place it’ll be use­ful!


Over­all, I think I’m more op­ti­mistic about de­bate than I was be­fore I played the de­bate game (I had read an ear­lier draft of the pa­per), and am ex­cited to see what strate­gies perform well /​ what ad­di­tional mod­ifi­ca­tions make the game more challeng­ing or easy. (To be clear, I ex­pect that de­bate will play a small part in al­ign­ment, rather than be­ing a cen­tral pillar, and think that train­ing AIs to per­suade hu­mans is a dan­ger­ous road to travel down, but think that the ad­ver­sar­ial fram­ing of de­bate makes this some­what safer and could likely have ap­pli­ca­tions in many other sub­fields of al­ign­ment, like trans­parency.)