Writeup: Progress on AI Safety via Debate

This is a writeup of the re­search done by the “Reflec­tion-Hu­mans” team at OpenAI in Q3 and Q4 of 2019. Dur­ing that pe­riod we in­ves­ti­gated mechanisms that would al­low eval­u­a­tors to get cor­rect and helpful an­swers from ex­perts, with­out the eval­u­a­tors them­selves be­ing ex­pert in the do­main of the ques­tions. This fol­lows from the origi­nal work on AI Safety via De­bate and the call for re­search on hu­man as­pects of AI safety, and is also closely re­lated to work on Iter­ated Am­plifi­ca­tion.

Authors and Acknowledgements

The main re­searchers on this pro­ject were Eliz­a­beth Barnes, Paul Chris­ti­ano, Long Ouyang and Ge­offrey Irv­ing. We are grate­ful to many oth­ers who offered ideas and feed­back. In par­tic­u­lar: the cross-ex­am­i­na­tion idea was in­spired by a con­ver­sa­tion with Chel­sea Voss; Adam Gleave had helpful ideas about the long com­pu­ta­tion prob­lem; Jeff Wu, Danny Her­nan­dez and Gretchen Krueger gave feed­back on a draft; we had helpful con­ver­sa­tions with Amanda Askell, An­dreas Stuh­lmüller and Joe Col­l­man, as well as oth­ers on the Ought team and the OpenAI Reflec­tion team. We’d also like to thank our con­trac­tors who par­ti­ci­pated in de­bate ex­per­i­ments, es­pe­cially David Jones, Erol Ak­baba, Alex Deam and Chris Pain­ter. Oliver Habryka helped for­mat and edit the doc­u­ment for the AI Align­ment Fo­rum.

Note by Oliver: There is cur­rently a bug with links to head­ings in a post, caus­ing them to not prop­erly scroll when clicked. Un­til that is fixed, just open those links in a new tab, which should scroll cor­rectly.

Overview

Motivation

As we ap­ply ML to in­creas­ingly im­por­tant and com­plex tasks, the prob­lem of eval­u­at­ing be­havi­our and pro­vid­ing a good train­ing sig­nal be­comes more difficult.

We already see ex­am­ples of RL lead­ing to un­de­sir­able be­havi­ours that su­perfi­cially ‘look good’ to hu­man eval­u­a­tors (see this col­lec­tion of ex­am­ples). One ex­am­ple from an OpenAI pa­per is an agent learn­ing in­cor­rect be­havi­ours in a 3d simu­la­tor, be­cause the be­havi­ours look like the de­sired be­havi­our in the 2d clip the hu­man eval­u­a­tor is see­ing.

We’d like to en­sure that AI sys­tems are al­igned with hu­man val­ues even in cases where it’s be­yond hu­man abil­ity to thor­oughly check the AI sys­tem’s work.

We can learn about de­sign­ing ML ob­jec­tives by study­ing mechanisms for elic­it­ing helpful be­hav­ior from hu­man ex­perts. For ex­am­ple, if we hire a physi­cist to an­swer physics ques­tions and pay them based on how good their an­swers look to a layper­son, we’ll in­cen­tivize lazy and in­cor­rect an­swers. By the same to­ken, a re­ward func­tion based on hu­man eval­u­a­tions would not work well for an AI with su­per­hu­man physics knowl­edge, even if it works well for mod­ern ML.

If we can de­velop a mechanism that al­lows non-ex­pert hu­mans to re­li­ably in­cen­tivize ex­perts to give helpful an­swers, we can use similar mechanisms to train ML sys­tems to solve tasks where hu­mans can­not di­rectly eval­u­ate perfor­mance. Con­versely, if we can’t in­cen­tivize ex­perts to be­have helpfully, that sug­gests it will also be difficult to train ML sys­tems with su­per­hu­man ex­per­tise on open-ended tasks.

One broad mechanism that might work is to in­voke two (or more) com­pet­ing agents that cri­tique each oth­ers’ po­si­tions, as dis­cussed in the origi­nal de­bate pa­per[1]. This can be simu­lated by hav­ing hu­man de­baters ar­gue about a ques­tion and a judge at­tempt to pick the cor­rect an­swer.

In the rest of this doc­u­ment, we’ll de­scribe the re­search done by re­flec­tion-hu­mans in Q3 and Q4 on in­ves­ti­gat­ing and de­vel­op­ing mechanisms that in­cen­tivize hu­man ex­perts to give helpful an­swers.

Cur­rent process

Dur­ing the early stages, we iter­ated through var­i­ous differ­ent do­mains, re­search method­olo­gies, judge pools, and re­search pro­cesses. More de­tails of this early iter­a­tion are here.

In Q4 we con­verged on a re­search pro­cess we’re more happy with. We’re fo­cus­ing on im­prov­ing our de­bate mechanisms as fast as pos­si­ble. We’re us­ing mostly in­ter­nal iter­a­tion (as op­posed to ex­ter­nal judge and de­bater pools) to test these mechanisms, as they still have a lot of easy-to-find failures to work through. Once we get to a point where we have a mechanism we be­lieve works well, we will try differ­ent ways to break it.

We make progress by go­ing through a loop of:

  1. Run de­bates. See if they work.

    • If they are re­li­ably work­ing, try harder to break them, by scal­ing up the num­ber of de­bates, choos­ing harder ques­tions, in­ter­nal and ex­ter­nal red-team­ing, offer­ing a bounty, e.t.c

    • If they fail, iden­tify a prob­lem and make it crisp, with prac­ti­cal ex­am­ples and/​or the­o­ret­i­cal characterisation

  2. De­sign a new mechanism to ad­dress this prob­lem (that doesn’t ob­vi­ously in­tro­duce ad­di­tional prob­lems!)

  3. In­te­grate the mechanism into our de­bate struc­ture and make it prac­ti­cal for hu­mans to de­bate us­ing this struc­ture.

This pro­cess should re­sult in ei­ther a pos­i­tive or nega­tive re­sult. The pos­i­tive re­sult would be a de­bate struc­ture that re­li­ably works given the hard­est ques­tions and most de­vi­ous dishon­est de­bat­ing strate­gies we can throw at it [2]. A nega­tive re­sult would ideally be a crisp prob­lem that demon­strates why de­bate is im­pos­si­ble (for cer­tain classes of ques­tions, at least), but might also just be a ques­tion plus dishon­est de­bat­ing strat­egy that re­li­ably leads to a dishon­est win.

The pos­i­tive re­sult yields a mechanism that we can use to train su­per­hu­man ML sys­tems. The nega­tive re­sult in­forms us that ML sys­tems with su­per­hu­man ex­per­tise may well ex­hibit be­havi­our such that hu­man eval­u­a­tors will fail to cor­rectly judge whether the be­havi­our is cor­rect, helpful, or safe.

Our task

The ba­sic de­bate setup in­volves two ex­pert de­baters (stand-ins for highly ca­pa­ble ML sys­tems), and a judge. One de­bater defends the cor­rect an­swer and is hon­est through­out, the other de­bater defends an an­swer that is clearly worse. The judge’s goal is to iden­tify the cor­rect an­swer.

Our cur­rent tar­get is for judges to iden­tify the cor­rect an­swer at fre­quency >90%, with a max­i­mum of 10 min­utes to read the de­bate and make their de­ci­sion.

We are cur­rently us­ing ques­tions from the book ‘Think­ing Physics’ - short but tricky prob­lems in ar­eas rang­ing from me­chan­ics to rel­a­tivity, that gen­er­ally re­quire care­ful un­der­stand­ing of physics con­cepts but don’t re­quire much math­e­mat­i­cal work­ing. We have a few ques­tions we’re par­tic­u­larly happy with, that our de­baters un­der­stand deeply and are con­fi­dent in but are still very con­fus­ing to judges. Th­ese are dis­cussed here. (An ex­am­ple ques­tion is: “An ice­cube with a large air bub­ble in it is float­ing in wa­ter. When it melts, does the wa­ter level go down, stay the same or rise?”)

We ex­per­i­mented with other tasks in­clud­ing text-based ques­tion-an­swer­ing, where the de­baters have full ac­cess to the text and the judge only sees limited quotes. We set­tled on Think­ing Physics prob­lems for sev­eral rea­sons, in­clud­ing that the ques­tions re­quire us­ing con­cepts that are un­fa­mil­iar and con­fus­ing to the judge. This is challeng­ing but seems like a prob­lem we need to be able to deal with.

Progress so far

We ob­served var­i­ous prob­lems with in­for­mal, free-text de­bates—the dishon­est de­bater could of­ten evade be­ing pinned down and avoid giv­ing a pre­cise an­swer to the other de­bater’s ques­tions, and of­ten gained con­trol of the ‘nar­ra­tive flow’ of the de­bate, steer­ing it away from the weak parts of their ar­gu­ment. To ad­dress this we con­sid­ered var­i­ous struc­tured de­bate for­mats, in­volv­ing ex­plicit re­cur­sion on a par­tic­u­lar sub-com­po­nent of an ar­gu­ment. The de­baters choose one claim to re­curse on, and the next round of the de­bate is fo­cused on that claim. The de­bate is re­solved based on the judge’s opinion of who won the fi­nal round, which should be about a very nar­row, spe­cific claim. Th­ese early prob­lems are dis­cussed here.

How­ever, par­tic­u­larly once re­cur­sion was in­tro­duced, we found prob­lems with am­bi­guity. It is very difficult to re­fer pre­cisely to con­cepts in 250 char­ac­ters of text, es­pe­cially if the con­cepts are un­fa­mil­iar to the judge. The dishon­est de­bater can ex­ploit this to their ad­van­tage, by claiming to have meant what­ever is most con­ve­nient given the par­tic­u­lar part of their ar­gu­ment that’s be­ing challenged. This prob­lem is similar to the Motte and Bailey fal­lacy. More de­tails of the prob­lem are here.

To ad­dress this prob­lem, we al­low a de­bater to “cross-ex­am­ine” mul­ti­ple copies of the op­pos­ing de­bater who are not al­lowed to com­mu­ni­cate. A de­bater can cite quotes from cross-ex­am­i­na­tion to ex­hibit in­con­sis­ten­cies in the other de­bater’s ar­gu­ment.

This forces the dishon­est de­bater to ei­ther com­mit to all the de­tails of their ar­gu­ment ahead of time (in which case the hon­est de­bater can fo­cus on the flaw), or to an­swer ques­tions in­con­sis­tently (in which case the hon­est de­bater can ex­hibit this in­con­sis­tency to the judge). The cross-ex­am­i­na­tion takes place out­side the main thread of de­bate, so doesn’t con­sume judge time/​at­ten­tion.

In the­ory-land this cross-ex­am­i­na­tion mechanism gives de­bate a boost from PSPACE to NEXP (proof), and it also seems to help in prac­tice. The mechanism is de­scribed as part of cur­rent rules here, and we dis­cuss why it helps with am­bi­guity here.

The next prob­lem we’re work­ing on is how to deal with ar­gu­ments that are too “big” for ei­ther de­bater to un­der­stand. In our cur­rent frame­work, if de­bater 1 makes an ar­gu­ment and de­bater 2 is un­able to ex­hibit a flaw in the ar­gu­ment, we as­sume that the ar­gu­ment is cor­rect. The dishon­est de­bater can ex­ploit this by mak­ing up some very large and slightly flawed ar­gu­ment and claiming it sup­ports their po­si­tion. If the hon­est de­bater doesn’t know ex­actly which part of the ar­gu­ment is flawed, even if they know there is a flaw some­where, we’ll as­sume the ar­gu­ment is cor­rect. More de­tails of this prob­lem are de­scribed here.


Things we did in Q3

Early iteration

Dur­ing the early stages, we ex­per­i­mented a lot with differ­ent do­mains, re­search method­olo­gies, judge pools, and re­search goals. Ini­tially our plans in­volved do­ing de­bates in a largely un­struc­tured for­mat, where the de­baters al­ter­nated mak­ing free-text state­ments of up to 350 char­ac­ters. We would then col­lect a large num­ber of judge­ments from show­ing these tran­scripts to Mturk work­ers, and do statis­ti­cal com­par­i­sons against two baselines. The eas­ier baseline was the ac­cu­racy of the judge’s an­swers to the ques­tion be­fore see­ing any in­for­ma­tion, and the more challeng­ing baseline was the ac­cu­racy of the judge’s an­swers af­ter be­ing shown a sin­gle ex­pla­na­tion of ei­ther the cor­rect or in­cor­rect an­swer. We moved away from that for sev­eral rea­sons.

Ini­tially we were con­cerned that it’s hard to get high-qual­ity data from Mturk, even if we screen for smart work­ers—the work­ers try to com­plete the tasks as fast as pos­si­ble. We also thought we might need more in­ter­ac­tivity be­tween judges and de­baters.

How­ever, when we switched to us­ing higher-qual­ity judges who could in­ter­act with de­baters, we found that our de­bates still didn’t work re­li­ably—even with very smart and mo­ti­vated judges who spent ~an hour.

To have con­fi­dence in the de­bate mechanism, we need hon­est strate­gies to have a fun­da­men­tal, sys­tem­atic ad­van­tage. This is the case in the­o­ret­i­cal mod­els of de­bate, but we weren’t con­fi­dent it was the case in prac­tice. For ex­am­ple, in the the­o­ret­i­cal pic­ture the hon­est de­bater can fo­cus on the part of the dishon­est de­bater’s ar­gu­ment that con­tains a flaw, but this didn’t re­li­ably hap­pen in prac­tice—the dishon­est de­bater was of­ten able to weasel out and di­vert at­ten­tion el­se­where.

This con­sid­er­a­tion led us to tar­get a thresh­old of perfor­mance where hon­esty has a clear, sys­tem­atic ad­van­tage: judges should be able to iden­tify the cor­rect an­swer al­most all the time (>90%), and it should feel to the ex­per­i­menters and the par­ti­ci­pants that dishon­est an­swers are fun­da­men­tally dis­ad­van­taged. This is also much quicker to mea­sure than do­ing statis­ti­cal com­par­i­sons against baselines, al­low­ing us to iter­ate faster.

We be­came more con­vinced that our cur­rent de­bate pro­ce­dure wouldn’t hit this perfor­mance thresh­old, and that we should fo­cus on in­ter­nal iter­a­tion to de­velop an im­proved de­bate pro­ce­dure.

Early prob­lems and strategies

Difficulty pin­ning down the dishon­est debater

A key re­quire­ment for some­thing like de­bate to work well is that the hon­est de­bater can fo­cus in on the part of the dishon­est ar­gu­ment that con­tains a flaw, and the de­bate will be de­cided based on whether that small piece of the ar­gu­ment is cor­rect. If this is the case, then the dishon­est de­bater can’t make any claims that are clearly false when in­ves­ti­gated closely, and must stick to claims that are plau­si­bly true when scru­ti­nized. This should make it much harder to con­struct ar­gu­ments for an in­cor­rect an­swer.

In our early, free-text, de­bates we found that this prop­erty of­ten didn’t hold. The dishon­est de­bater was of­ten able to ig­nore the hon­est de­bater’s ques­tions or re­quests to fo­cus on one part of the ar­gu­ment, and it was hard for the judge to keep track of which ob­jec­tions or ques­tions had been ad­dressed and which hadn’t. The dishon­est de­bater could also pre­tend to mi­s­un­der­stand the hon­est de­bater’s ques­tion, or claim that it wasn’t well-speci­fied enough to an­swer. The de­bates of­ten looked more like the de­baters re­peat­ing slightly differ­ent ver­sions of the most com­pel­ling part of their ar­gu­ment, rather than zoom­ing in on dis­agree­ments.

For ex­am­ple [3]:

A: Ex­cel­lent, we agree. Now imag­ine the ice freezes nonuniformly, so that its den­sity varies through­out the berg. Does the level still re­main the same af­ter melt­ing, re­gard­less of the vari­a­tion?

The dishon­est de­bater didn’t want to an­swer this ques­tion, so they pre­tended to not know what it meant in or­der to play for time, and added in a new ques­tion of their own to try to gain con­trol of the flow of the de­bate.

B: What do you mean by the den­sity vary­ing? Do you mean the case where the over­all den­sity (to­tal mass/​to­tal vol) stays con­stant? Also, Q for you: can you ex­plain why your an­swer is the same for A and W even though their den­si­ties are differ­ent?

We con­sid­ered var­i­ous differ­ent ap­proaches to ad­dress this, in­clud­ing var­i­ous for­mal ar­gu­men­ta­tion styles, pre­com­mit­ment, and var­i­ous re­cur­sive struc­tures. How­ever, all of these more rigid struc­tures high­lighted prob­lems re­lated to am­bi­guity: if you try to get the dishon­est de­bater to com­mit to a spe­cific claim, or re­curse on a spe­cific part of their ar­gu­ment, they fre­quently have wig­gle room to in­ter­pret the claim in what­ever way is most con­ducive to their cur­rent strand of ar­gu­ment.

Asymmetries

Another prob­lem we no­ticed is that one de­bater may gain con­trol over what frame­work is used to ap­proach the ques­tion. Often the dishon­est ar­gu­ment takes the form of break­ing down the prob­lem in the wrong way, then mak­ing claims which are in­di­vi­d­u­ally true, but in this mis­lead­ing frame­work lead to the wrong an­swer. If the dishon­est de­bater man­ages to make their case com­pel­lingly, and the hon­est de­bater is re­duced to try­ing to at­tack the dishon­est ar­gu­ment, this seems to go worse than if the re­verse hap­pened and the hon­est de­bater was the one mak­ing a con­struc­tive ar­gu­ment.

In ad­di­tion, Evan Hub­inger was con­cerned about asym­me­tries of offense/​defense and con­struc­tive vs de­struc­tive ar­gu­ments, as this is a known phe­nomenon in Policy De­bate.

Fi­nally, we were con­cerned about first- or last-mover ad­van­tages.

In or­der to ad­dress this, we wanted our de­bate struc­ture to have some sym­met­ri­cal char­ac­ter­is­tics:

  • Both de­baters move simultaneously

  • Both de­baters make a con­struc­tive argument

  • Both de­baters get to at­tack the oth­ers’ argument


Ques­tions we’re using

There are var­i­ous prop­er­ties we want for an ideal de­bate ques­tion:

  1. Smart, at­ten­tive judges fre­quently fail to choose the cor­rect an­swer, but de­baters can re­li­ably choose the cor­rect an­swer.

  2. Ex­per­i­menters can un­der­stand the prob­lem fairly well

  3. There isn’t a sim­ple, knock-down ar­gu­ment for the cor­rect answer

  4. Not too much for­mal math­e­mat­i­cal work­ing is required

  5. The de­baters’ abil­ity to an­swer is based on un­der­stand­ing the prob­lem, not just ob­ser­va­tional evidence

  6. If you un­der­stand the prob­lem, there is clearly one right answer

(1) is ob­vi­ously im­por­tant for us to be able to tell if de­bate is work­ing.

(2) is im­por­tant be­cause it’s re­ally helpful if the ex­per­i­menters are able to fol­low ex­actly what’s go­ing on in the de­bate—which claims are cor­rect and in­cor­rect, and how differ­ent parts of the ar­gu­ment sup­port each other. This is some­what in con­flict with (1), but the ex­per­i­menters can also spend time read­ing and dis­cussing the an­swer, and do­ing some learn­ing to un­der­stand the prob­lem bet­ter than the judges.

(3) is needed for an in­ter­est­ing de­bate—some hard ques­tions do just need one sim­ple in­sight, but we’re more in­ter­ested in prob­lems that re­quire ag­gre­gat­ing var­i­ous strands of rea­son­ing to find the right an­swer.

(4) is de­sir­able to keep the de­bate fo­cused on the ar­eas we’re most in­ter­ested in. We’re fairly con­fi­dent that de­bate will work fine for for­mal math­e­mat­ics [4], and we mostly want to in­ves­ti­gate more fuzzy rea­son­ing.

We ran into some prob­lems with (5) when us­ing a ques­tion about coun­ter­in­tu­itive be­havi­ours of slinkies. There are defini­tive video demon­stra­tions of the be­havi­our, but it ap­pears that no-one ac­tu­ally un­der­stands why the be­havi­our hap­pens—at least, our de­baters and ex­per­i­menters couldn’t eas­ily find a com­pel­ling ex­pla­na­tion. We don’t ex­pect or re­quire de­bate to work if the de­bater’s ex­pla­na­tions are ‘I don’t know why this is the case but I have em­piri­cal ev­i­dence’, so we want to avoid ques­tions of this type.

We also en­coun­tered prob­lems with (6), for sev­eral ques­tions that ini­tially seemed promis­ing. Dur­ing the pro­cess of con­struct­ing dishon­est ar­gu­ments, we re­al­ised that the ques­tion was un­der­speci­fied and there was a pretty rea­son­able case for a differ­ent an­swer. Hope­fully we can of­ten re­solve this by chang­ing the ques­tion setup to make sure there’s only one cor­rect an­swer.

Gen­er­ally, we want to find ques­tions that are ‘the eas­iest thing that should work but doesn’t’ - this helps us iso­late where ex­actly the prob­lems are.

With that in mind, here are some of our favourite ques­tions:


Cur­rent de­bate rules

This is the rule set that we cur­rently think is most promis­ing. It in­cor­po­rates sev­eral differ­ent mechanisms we think are use­ful: ex­plicit re­cur­sion on claims, a sym­met­ric offense/​defense struc­ture, and cross-ex­am­i­na­tion.


Ba­sic in­for­mal outline

  • There are a num­ber of rounds. In each round, both de­baters make their ar­gu­ment, con­sist­ing of sev­eral sub-claims, to sup­port their claim.

  • Then, both de­baters add ob­jec­tions to parts of the other’s ar­gu­ment.

  • Then, each de­bater chooses whether to cede or challenge the ob­jec­tions. If they cede the ob­jec­tion, it gets added in to their ar­gu­ment; if they challenge it they are es­sen­tially say­ing they’re happy to die on the hill of whether that ob­jec­tion was cor­rect.

  • We choose one of the ob­jec­tions that was challenged, and re­curse: we start an­other round of de­bate where the top-level claim is about whether the ob­jec­tion is good

  • Even­tu­ally we ei­ther reach a depth limit, or there are no challenged ob­jec­tions. Then the judge looks at the two claims and sup­port­ing ar­gu­ments for this round, and de­cides which is bet­ter. The whole de­bate is re­solved based on this.

  • At any point, a type of ar­gu­ment a de­bater can sup­ply is a cross-ex­am­i­na­tion. They can ask a ques­tion to a pre­vi­ous ver­sion of the other de­bater (for ex­am­ple, the ver­sion af­ter the end of the first round). They can then quote this ques­tion-an­swer ex­change as part of their ar­gu­ment.

Com­pre­hen­sive rules

  • At the be­gin­ning of each round, one de­bater is defend­ing a claim and the other is ob­ject­ing to it. The judge’s job is to de­cide whether the claim+ob­jec­tion is more ac­cu­rate and helpful than the claim on its own. We also have a depth limit that bounds the num­ber of rounds we can have.

  • Each player sup­ports their po­si­tion by si­mul­ta­neously mak­ing an ar­gu­ment con­sist­ing of 2-4 claims. The to­tal length of the ar­gu­ment (the top-level claim or ob­jec­tion, plus the sup­port­ing claims) is bounded to 250 char­ac­ters.

  • Each player then si­mul­ta­neously may make any num­ber of ob­jec­tions to the other player’s ar­gu­ment. Th­ese may ei­ther be ob­jec­tions to par­tic­u­lar claims, or ob­jec­tions to the over­all struc­ture of the ar­gu­ment. The to­tal length of their ob­jec­tions is bounded to 250 char­ac­ters.

  • The tran­script of the round is a short text rep­re­sen­ta­tion of the round that dis­plays the player’s top-level claims, their ar­gu­ments, and any ob­jec­tions. We’ll choose this for­mat to be as helpful to the judge as pos­si­ble. See be­low for some ex­am­ples of our cur­rent for­mat.

  • If the depth limit is >0, then for each ob­jec­tion Alice raises to Bob’s ar­gu­ment, Bob may choose to con­cede or challenge the ob­jec­tion.

  • If there are any challenged ob­jec­tions and the depth limit is >0, then we choose one challenged ob­jec­tion to re­curse on:

    • We don’t define how to make this choice, so in or­der to be con­ser­va­tive we’re cur­rently al­low­ing the mal­i­cious de­bater to choose which to re­curse on.

    • After choos­ing a (claim, ob­jec­tion) pair, we start a new round with one player defend­ing the claim and the other defend­ing the ob­jec­tion. The depth limit of the new round is re­duced by 1.

    • The win­ner of the new round wins the game. (Though there may be fur­ther re­cur­sion.)

  • Ar­bi­tra­tion: If all ob­jec­tions are ceded or the depth limit=0, we pro­ceed to ar­bi­tra­tion:

    • We take the tran­script for this round and all pre­ced­ing rounds and put them to­gether into a sin­gle doc­u­ment. The judge should mostly fo­cus on the fi­nal round, but they may need to read text from pre­ced­ing rounds in or­der to have enough con­text to un­der­stand what the claims in the fi­nal round mean.

    • If the depth limit is >0, we tell the judge to up­date on the fact that all ob­jec­tions were con­ceded and no ad­di­tional ob­jec­tions were made. That is, we tell them to as­sume:

      • If we had de­bated any ob­jec­tion, it’s more likely than not that you would have ac­cepted it as valid.

      • There are no ob­vi­ous ob­jec­tions to any other claim that you would have prob­a­bly ac­cepted as valid.

    • We tell the judge that both play­ers are try­ing their best to con­vince the judge of their po­si­tion. We tell the judge that ex­actly one of the play­ers is try­ing to be as helpful as pos­si­ble and make true claims.

    • In light of that in­for­ma­tion, the judge de­cides whether they think that the claim+ob­jec­tion is more ac­cu­rate and helpful than the claim on its own. Based on their an­swer, the player who was ad­vanc­ing the ob­jec­tion ei­ther wins or loses.

  • Cross ex­am­i­na­tion: In lieu of mak­ing an ar­gu­ment, a de­bater may de­cide to cross-ex­am­ine. They choose a prior claim, ob­jec­tion, or cross-ex an­swer to ask about, and write a ques­tion of at most 200 char­ac­ters. We then cre­ate a copy of the de­bater who wrote the origi­nal text, at the time when they wrote it (i.e. di­rectly af­ter writ­ing the ar­gu­ment, writ­ing the ob­jec­tions, or writ­ing the an­swer). That copy an­swers the ques­tion in at most 200 char­ac­ters, and the round tran­script in­cludes this (ques­tion, an­swer) pair in­stead of an ar­gu­ment from the cross-ex­am­iner. The tran­script also speci­fies who is be­ing cross-ex­am­ined. The de­baters can ask their ques­tion a mod­er­ate num­ber of times (e.g. 10 times) and cherry-pick which an­swer to show to the judge.

    • Cross ex­am­i­na­tion in­volves cre­at­ing copies of a de­bater at a past point in time. This ob­vi­ously isn’t pos­si­ble, but we de­scribe three real games de­signed to ap­prox­i­mate this ideal: (i) a game played be­tween two teams of 2, (ii) a game played be­tween ML sys­tems, (iii) a game played be­tween two hu­mans which may take an ex­po­nen­tially long time, but can finish quickly if both play­ers make a good faith effort to abide by the spirit of the rules.

  • Sim­ple refer­ences: a de­bater can define a refer­ence to a par­tic­u­lar piece of text by us­ing the syn­tax {some text}=(3). They can then use this in ar­gu­ments to save space, e.g. “(3) im­plies (4) dis­places more wa­ter than (5)”. They can derefer­ence us­ing the syn­tax (3)=”some text”. The judge can trust that these derefer­ences are cor­rect.

Ex­am­ple debate

Ba­sic anatomy of a de­bate

Each de­bater makes their ar­gu­ments in one column. After this they add ob­jec­tions to each oth­ers’ ar­gu­ments (si­mul­ta­neously). Here the hon­est de­bater is blue text and the dishon­est de­bater is red.

We choose one claim-ob­jec­tion pair to re­curse on (bolded in this ex­am­ple)

Then this pair be­comes the top-level claim for the next round, and the de­baters make their ar­gu­ments.

The de­baters may use refer­ences, and may ex­hibit cross-ex­am­i­na­tion ques­tions and an­swers to sup­port their argument

De­baters make arguments

Hon­est de­bater makes their ar­gu­ments in this colum­nDishon­est de­bater makes their ar­gu­ments in this column

Round 1, depth=5

Ob­jec­tion: There are 8 primes that are <=20

  • There are 4 primes that are <=10

  • There are 4 primes that are >10 and <=20

Claim: There are 9 primes that are <=20

  • There are 5 primes that are <=10

  • There are 4 primes that are >10 and <=20

Out-of-band com­mu­ni­ca­tion may oc­cur, then de­baters may add objections

Round 1, depth=5

Ob­jec­tion: There are 8 primes that are <=20

  • There are 4 primes that are <=10

    • Ob­jec­tion: this sub­claim is false

  • There are 4 primes that are >10 and <=20

Claim: There are 9 primes that are <=20

  • There are 5 primes that are <=10

    • Ob­jec­tion: there are 4 primes that are <=10

  • There are 4 primes that are >10 and <=20


We choose to re­curse on one objection

We choose to re­curse on the right ob­jec­tion (bolded). The next round is now about this claim-ob­jec­tion pair

Round 2, depth=4

Ob­jec­tion: there are 4 primes that are <=10

  • There are 3 primes that are <=5

  • There is 1 prime that is >5 and <=10

Claim: There are 5 primes that are <=10

  • There are 3 primes that are be­tween 0 and 5

  • There are 2 primes that are be­tween 5 and 10

Out-of-band cross-ex­am­i­na­tion (not visi­ble to the judge), that takes place be­fore the ob­jec­tions are made, and can be ex­hibited to the judge later:

Hon­est de­bater: what does “be­tween 5 and 10” mean, us­ing “<”, “>” and “=” ?

Dishon­est de­bater: “be­tween 5 and 10” means >5 and <=10

Here the dishon­est de­bater has used an am­bigu­ous defi­ni­tion of the in­ter­val (“be­tween”) to try to avoid los­ing. How­ever, be­cause of cross-ex­am­i­na­tion they have to com­mit to one defi­ni­tion—in this case they choose the defi­ni­tion that makes their in­ter­vals cover the space but makes their sec­ond claim false. If they had cho­sen a differ­ent defi­ni­tion, the hon­est de­bater would have challenged their over­all ar­gu­ment struc­ture—challenged that the two sub­claims im­ply the top-level claim

Round 2, depth=4

Ob­jec­tion: there are 4 primes that are <=10

  • There are 3 primes that are <=5

  • There is 1 prime that is >5 and <=10

    • Ob­jec­tion: there are two

Claim: There are 5 primes that are <=10

  • There are 3 primes that are be­tween 0 and 5

  • There are 2 primes that are be­tween 5 and 10

    • Ob­jec­tion: there is one prime that is be­tween 5 and 10

We choose to re­curse on the right ob­jec­tion (bolded)

Round 3, depth=3

Ob­jec­tion: there is only one prime that is be­tween 5 and 10

  • Cross ex­am­i­na­tion quote
    Q: what does “be­tween 5 and 10” mean, us­ing “<”, “>” and “=” ?
    A: “be­tween 5 and 10” means >5 and <=10

  • 7 is the only prime that is >5 and <=10

Claim: There are 2 primes that are be­tween 5 and 10

  • 5 is be­tween 5 and 10 and is prime

    • Ob­jec­tion: 5 does not meet your defi­ni­tion of “be­tween 5 and 10” (ceded)

  • 7 is be­tween 5 and 10 and is prime

As there are no stand­ing ob­jec­tions, we pro­ceed to ar­bi­tra­tion: the judge looks at the tran­script and de­cides whether the claim or claim+ob­jec­tion from the fi­nal round is bet­ter. They may look at other rounds for con­text, but the judge­ment should be about who wins the fi­nal round.

As­sump­tions the judge should make

The judge should as­sume that at least one de­bater is perfectly hon­est and never lies or mis­leads. This means if they en­counter com­pel­ling ev­i­dence that a de­bater has lied or been mis­lead­ing, that de­bater should lose the de­bate. It also means that if both de­baters agree about a claim, they can take that claim to be true. This is in con­trast to try­ing to weigh the over­all strength of each de­bater’s ar­gu­ments, which can be a much harder task than try­ing to de­tect ev­i­dence of a lie.

Th­ese as­sump­tions are ob­vi­ously not guaran­teed to be true. How­ever, if judg­ing in­cen­tivises hon­esty un­der these con­di­tions (ie, for all dishon­est strate­gies, the judge prefers hon­esty to the dishon­est strat­egy over 50% of the time) then op­ti­mal play will be hon­est, which is what we re­quire. The judge be­havi­our when these as­sump­tions are false (ie when both de­baters are ly­ing) doesn’t re­ally mat­ter, be­cause it doesn’t change the op­ti­mal play. How­ever, we haven’t fully re­solved what im­pacts this as­sump­tion might have for ML train­ing.


Am­bi­guity problem

Overview

We know in the­ory that de­bate can’t solve ar­bi­trary prob­lems that re­quire state that’s big­ger than the length of the tran­script the judge sees. In prac­tice, we’ve found we quickly run into prob­lems where de­baters need to in­tro­duce and ma­nipu­late a con­cept that re­quires a large amount of text to ex­plain pre­cisely to the judge. With their limited char­ac­ter bud­get they can only give am­bigu­ous de­scrip­tions, that could re­fer to many differ­ent pre­cise con­cepts.

This is prob­le­matic be­cause the dishon­est de­bater can ex­ploit this am­bi­guity to their ad­van­tage, but the judge can’t sim­ply de­clare that a de­bater who makes an am­bigu­ous ar­gu­ment loses, be­cause the hon­est de­bater will in­evitably say am­bigu­ous things too.

Why do we ex­pect to en­counter am­bi­guity in prac­tice?

In the physics de­bates we’ve been do­ing, we’ve fre­quently found that a seem­ingly straight­for­ward phrase has a lot of edge-cases and rea­son­able differ­ent in­ter­pre­ta­tions that the out­come of the de­bate can hinge on. (An ex­am­ple is in­cluded at the end of this doc­u­ment).

More gen­er­ally, we want to be able to use de­bate to tackle prob­lems that are too big and difficult for a judge to di­rectly tackle them­selves. A suc­cess­ful de­bate mechanism is likely to in­volve break­ing down the ques­tion into some small pieces that a judge can eval­u­ate suc­cess­fully. But break­ing off a small piece of an ar­gu­ment and drop­ping the sur­round­ing con­text is likely to in­tro­duce a large amount of am­bi­guity.

Sim­ple example

The dishon­est de­bater can make an am­bigu­ous claim X, which has at least two rea­son­able in­ter­pre­ta­tions: X1, which is true but does not im­ply Y, or X2 which is false but im­plies Y. They can make the ar­gu­ment:

(1) X is true 

(2) X → Y 

There­fore Y

If challenged on part 1, they can claim they meant true claim X1, which does not im­ply Y.

If challenged on part 2, they can use X2 → Y

This is very similar to the Motte and Bailey fal­lacy, where an ar­gu­menter ad­vances a con­tro­ver­sial or strong po­si­tion (ie X2) but when challenged claims they meant a weaker, eas­ier-to-defend po­si­tion (ie X1).

This is a sim­plis­tic ex­am­ple and is not nec­es­sar­ily a prob­lem if X1 and X2 are the only op­tions; the hon­est de­bater can just ask which one they meant. It be­comes more prob­le­matic if X can’t be dis­am­biguated within the amount of text the judge can read (given all the other con­cepts which will also need to be dis­am­biguated). If so, then there will always be at least two re­main­ing in­ter­pre­ta­tions, and the dishon­est de­bater will still be able to pick what­ever in­ter­pre­ta­tion is most con­ve­nient given how the hon­est de­bater chooses to challenge them.

Am­bi­guity in prac­tice

We have a setup in­volv­ing an ice­berg (which con­tains a large air bub­ble) float­ing in a tub of wa­ter. We want to talk about “the dis­place­ment due to the ice­berg”. Depend­ing on the ex­act con­text in which this is used, it can have many rea­son­able in­ter­pre­ta­tions.

For in­stance, the phrase “the ice­berg” might re­fer to just the ice part of the ice­berg rather than the air bub­ble (im­ply­ing, for ex­am­ple, that the mass of the ice­berg doesn’t change when the air es­capes). The dis­place­ment “due to” this might re­fer to the vol­ume of ice that’s tak­ing up space un­der the wa­ter­line, or might re­fer to the dis­place­ment caused by the weight of the ice (which is greater, be­cause it in­cludes the vol­ume of air be­low the wa­ter­line). Weight it­self has two differ­ent rea­son­able mean­ings here: force ex­erted down­wards, or mass*grav­ity. Th­ese are not equal be­cause of the ice­berg ob­ject’s slight buoy­ancy in air. We could also be talk­ing about the dis­place­ment of air as well as, or in­stead of, the dis­place­ment of wa­ter.

Each of these differ­ent in­ter­pre­ta­tions could eas­ily make the differ­ence be­tween a state­ment about “the dis­place­ment due to the ice­berg” be­ing cor­rect or not.

This ex­am­ple came up in physics de­bates:

De­bater A: When the air es­capes from the ice­berg, the ice­berg sinks lower in the wa­ter and its dis­place­ment in­creases, so the wa­ter spills over.

De­bater A is us­ing an ar­gu­ment struc­ture some­thing like:

(1) X
(2) X → Y

with

X = “The dis­place­ment of the ice­berg in­creases” 
Y = “The wa­ter spills over”

They are equiv­o­cat­ing be­tween us­ing “the ice­berg” to mean “the whole ice­berg ob­ject in­clud­ing the air”, (which does not change in dis­place­ment), or “the part of the ice­berg that re­mains af­ter the air es­capes”, (which does in­crease its dis­place­ment). If de­bater B fo­cuses on challeng­ing that the wa­ter spills over, de­bater A will use the first defi­ni­tion. If de­bater B fo­cuses on challeng­ing that the dis­place­ment in­creases, A will use the sec­ond defi­ni­tion.

It is tricky for the hon­est de­bater to point out the differ­ence within a small char­ac­ter/​at­ten­tion bud­get. They can try to ask the dishon­est de­bater for clar­ifi­ca­tion, or they can say some­thing like: “in the top-level ar­gu­ment, the other de­bater used this to ar­gue that the wa­ter spills over, im­ply­ing that the to­tal amount of dis­placed wa­ter in­creases, but here they only defend the dis­place­ment due to the ice in­creas­ing.” This might work some­times, but in prac­tice it doesn’t seem to work re­li­ably. Vol­ume and dis­place­ment are rel­a­tively straight­for­ward and judge-un­der­stand­able prob­lems, and we’d ex­pect this kind of prob­lem to get much worse with more com­plex and un­fa­mil­iar con­cepts.

More for­mal example

In the origi­nal de­bate pa­per, the au­thors showed that a for­mal­ized ver­sion of de­bate where we re­place the hu­man judge with an ar­bi­trary polyno­mial time al­gorithm H : Q → {0, 1} can in­cen­tivise de­baters to give cor­rect an­swers to a prob­lem in PSPACE, but de­bate can’t an­swer any ques­tions be­yond PSPACE [5]. One way this will man­i­fest in prac­tice is via prob­le­matic am­bi­guity.

Con­sider a TM with ex­po­nen­tial run­time of 2n, and state size which is larger than the max­i­mum amount the judge can read (ie larger than the whole de­bate tran­script). The dishon­est de­bater claims the fi­nal state is C, which is in­cor­rect.

[No­ta­tion: I’ll use bold let­ters for am­bigu­ous, judge-read­able de­scrip­tions of states, and non-bold for non-am­bigu­ous, spe­cific states]

Dishon­est de­bater: At time 20 the state is A, at 2n-1 it is B, at time 2n it is C

Con­sider the situ­a­tion af­ter two or more moves:

Dishon­est de­bater: At time i the state is X, at ½(i+j) it is Y, and at time j it is Z

If the state was small enough to de­scribe it un­am­bigu­ously to the judge as X, Y and Z, the hon­est de­bater could sim­ply challenge the first or sec­ond half of the com­pu­ta­tion based on where an in­cor­rect step was lo­cated. (As X → Z is wrong, at least one of X → Y or Y → Z must be wrong)

But all three states will likely now be am­bigu­ous.There may be an ex­po­nen­tial num­ber of in­ter­pre­ta­tions of Y which are con­sis­tent with some in­ter­pre­ta­tion of X, and an ex­po­nen­tial num­ber con­sis­tent with some in­ter­pre­ta­tion of Z, even if no in­ter­pre­ta­tion is con­sis­tent with both. The hon­est de­bater can’t iso­late which side has a flaw with a polyno­mial num­ber of queries.


Why cross-ex­am­i­na­tion?

We be­lieve cross-ex­am­i­na­tion will make it harder for the dishon­est de­bater to ex­ploit am­bi­guity and re­lated is­sues, for a com­bi­na­tion of a few rea­sons. In the com­pu­ta­tional-com­plex­ity anal­ogy, this mechanism ap­pears to re­solve prob­lems re­lated to am­bi­guity, al­low­ing de­bate to solve prob­lems in NEXP as op­posed to PSPACE. (Proof)

We’ve also found that cross-ex­am­i­na­tion seems promis­ing in sev­eral prac­ti­cal ex­am­ples, and we have a gen­eral ar­gu­ment for why cross-ex­am­i­na­tion gives us an­other mechanism to tackle am­bi­guity in de­bate, that aug­ments the hon­est de­bater’s ex­ist­ing strate­gies of ask­ing for clar­ifi­ca­tion or ex­plain­ing to judge the role the con­cept plays in the rest of the ar­gu­ment.

In ad­di­tion, cross-ex­am­i­na­tion gives us var­i­ous prop­er­ties that we’d pre­vi­ously flagged as de­sir­able, “for free”. In par­tic­u­lar, it al­lows the de­baters to force each other to pre­com­mit to var­i­ous claims in ad­vance of the main de­bate, al­low­ing the es­tab­lish­ment of a pool of claims that both de­baters agree on that can there­fore be taken to be true. It also al­lows one de­bater to, at any point, force the other de­bater to pick a par­tic­u­lar stance on a ques­tion, rather than avoid­ing dis­cussing their po­si­tion.

Gen­eral method of ad­dress­ing ambiguity

Let’s con­sider an ex­am­ple of prob­le­matic am­bi­guity where the dishon­est de­bater ar­gues for Y by ar­gu­ing that (1) X is true and that (2) X im­plies Y. X is am­bigu­ous and has at least two pos­si­ble mean­ings, one of which is true and one of which im­ply Y, but none of which sup­port both parts of the ar­gu­ment.

If asked to ex­plain why X is true, they will claim they meant some in­ter­pre­ta­tion which is in fact true. If asked to ex­plain why X im­plies Y, they will claim they meant some in­ter­pre­ta­tion which does in fact im­ply Y.

In or­der to ex­pose the con­tra­dic­tion, the hon­est de­bater must ei­ther ask for clar­ifi­ca­tion in ad­vance of choos­ing which side to challenge, or they will have to ex­plain that the given in­ter­pre­ta­tion would make the other part of the ar­gu­ment false. Either of these may take too long for the judge’s at­ten­tion bud­get, and the sec­ond method in par­tic­u­lar is quite hard to fol­low for the judge (see ex­am­ple at the end of this doc­u­ment)

With the “cross-ex­am­i­na­tion” mechanism, we al­low one de­bater to ask ques­tions to an ear­lier ver­sion of the op­pos­ing de­bater. This ques­tion­ing takes place be­tween the de­baters out of the main flow of the de­bate.

The hon­est de­bater can choose some ques­tion to dis­am­biguate X. Based on the dishon­est de­bater’s an­swer to this ques­tion, they will de­cide whether to challenge the first or sec­ond claim.

The dishon­est de­bater has two op­tions: ei­ther they an­swer one way con­sis­tently, or they are in­con­sis­tent. If they an­swer one way con­sis­tently, the hon­est de­bater can choose to fo­cus on whichever side of the ar­gu­ment is made false by this an­swer. If they an­swer in­con­sis­tently, the hon­est de­bater can ex­hibit this in­con­sis­tency.

As pre­vi­ously, am­bi­guity may only be­come fatal when a con­cept can’t be dis­am­biguated in the space available for the de­bate. Here we’ve only dis­cussed am­bi­guity that can be re­solved with a sin­gle ques­tion, but we hope that this mechanism will also ad­dress more se­ri­ous am­bi­guity. Here’s a hand-wavy ar­gu­ment why:

Cross-ex­am­i­na­tion es­sen­tially forces de­baters to con­struct and com­mit to ev­ery­thing in their ar­gu­ment at the start of the de­bate. The dishon­est de­bater has two choices: ei­ther they can com­mit to an ar­gu­ment and an­swer ques­tions con­sis­tently, or they can be in­con­sis­tent. If they choose the first op­tion, the hon­est de­bater can eas­ily zoom in on the flaw in their ar­gu­ment. If they choose the sec­ond, the hon­est de­bater should be able to show some­how that they’re be­ing in­con­sis­tent.

The main thing that breaks this ar­gu­ment is if the dishon­est de­bater some­how man­ages to be very eva­sive, and says noth­ing that’s spe­cific enough to ei­ther in­dict a par­tic­u­lar part of their ar­gu­ment or be shown to be in­con­sis­tent. How­ever, they will need to make spe­cific claims to ad­vance an ar­gu­ment, and they should need to com­mit to these spe­cific claims up front. For ex­am­ple, if the hon­est de­bater asks them “do you agree that C is true?” and they give a non-com­mit­tal an­swer (e.g., “the claim is framed in the wrong way, it’s not re­ally cor­rect to say it’s true or false”), they shouldn’t be able to use C to ad­vance their ar­gu­ment, as the hon­est de­bater can ex­hibit their an­swer.

Over­all, we’re hope­ful that cross-ex­am­i­na­tion will be helpful, but it seems pos­si­ble that there’s a dishon­est strat­egy that breaks it, likely in­volv­ing some amount of eva­sion and in­con­sis­tency.

Tur­ing-ma­chine ex­am­ple of am­bi­guity and cross-examination

In the dis­cus­sion of am­bi­guity we de­scribed a more for­mal ver­sion of the am­bi­guity prob­lem as fol­lows:

Con­sider a TM with ex­po­nen­tial run­time of 2n which is larger than the max­i­mum amount the judge can read (ie larger than the whole de­bate tran­script). The dishon­est de­bater claims the fi­nal state is C, which is in­cor­rect.

[No­ta­tion: I’ll use bold let­ters for am­bigu­ous, judge-read­able de­scrip­tions of states, and non-bold for non-am­bigu­ous, spe­cific states]

Dishon­est de­bater: At time 20 the state is A, at 2n-1 it is B, at time 2n it is C

If the state was small enough to de­scribe it un­am­bigu­ously to the judge as B’, the hon­est de­bater could sim­ply challenge the first or sec­ond half of the com­pu­ta­tion based on where an in­cor­rect step was lo­cated. (As A → C is wrong, at least one of A → B’ or B’ → C must be wrong)

In the first round of de­bate, the am­bi­guity is not too prob­le­matic. There ex­ists ex­actly one BA that fol­lows from A and ex­actly one BC that leads to C, and these must differ by at least one bit. The hon­est de­bater can ask for this bit, then choose which side to challenge—if the bit is con­sis­tent with BA they will challenge the first half, and if it is con­sis­tent with BC they will challenge the sec­ond half.

But con­sider the situ­a­tion af­ter two moves:

Dishon­est de­bater: At time i the state is X, at ½(i+j) it is Y, and at time j it is Z

All three states are now nec­es­sar­ily am­bigu­ous.There may be an ex­po­nen­tial num­ber of in­ter­pre­ta­tions of Y which are con­sis­tent with some in­ter­pre­ta­tion of X, and an ex­po­nen­tial num­ber con­sis­tent with some in­ter­pre­ta­tion of Z, even if no in­ter­pre­ta­tion is con­sis­tent with both. The hon­est de­bater can’t iso­late which side has a flaw with a polyno­mial num­ber of queries.

How cross-ex­am­i­na­tion helps

With cross-ex­am­i­na­tion, we can give the hon­est de­bater a mechanism to iso­late the flaw in the dishon­est ar­gu­ment.

Dishon­est de­bater: At time 20 the state is A, at 2n-1 it is B, at time 2n it is C

Cross-ex­am­i­na­tion gives us the prop­erty that the dishon­est de­bater’s an­swers are in­de­pen­dent of the con­text be­tween when the copy of the dishon­est de­bater was made and when the ques­tion is posed.

We also rely on the prop­erty that, be­cause the hon­est and dishon­est de­baters are copies of each other, they know ev­ery­thing the other knows. In par­tic­u­lar, if the dishon­est de­bater will con­sis­tently an­swer a ques­tion in a par­tic­u­lar way, the hon­est de­bater knows what they will an­swer, and if they will not give a con­sis­tent an­swer the hon­est de­bater knows what dis­tri­bu­tion over an­swers they will give.

Case 1: At the be­gin­ning of the game, for all lo­ca­tions k in the com­pu­ta­tion, dishon­est de­bater has a par­tic­u­lar value SK in mind, and for all i will an­swer that bit [i] of the kth state equals Sk[i] with prob­a­bil­ity >50%,

In this case, the hon­est de­bater can play the same strat­egy as if the dishon­est de­bater had ex­plic­itly com­mit­ted to all the Sk. For any claim about states X Y → Z they can iden­tify which part con­tains an in­cor­rect step, and can challenge the ap­pro­pri­ate half of the com­pu­ta­tion.

This step is re­peated re­cur­sively un­til the de­baters are dis­agree­ing about a sin­gle in­cor­rect step of com­pu­ta­tion D->E at lo­ca­tion j, with cor­re­spond­ing pre­cise state val­ues Sj and Sj+1. The num­ber of bits ac­tu­ally in­volved in the com­pu­ta­tion are few enough for the judge to read.

Either the dishon­est de­bater can claim that these are con­sis­tent with Sj and Sj+1, in which case the com­pu­ta­tion step is in­cor­rect, or they can claim them to be some­thing that’s not con­sis­tent with Sj and Sj+1, . In this case, they have claimed at least one bit of D or E to be differ­ent from Sj or Sj+1. The hon­est de­bater can ask for this bit from a copy of the dishon­est de­bater which has no con­text be­yond the start of the game. Thus they know they will re­ceive the ap­pro­pri­ate bit of Sj or Sj+1 with prob­a­bil­ity >50%. By ask­ing a ques­tion about this bit mul­ti­ple times, they can ex­hibit a con­tra­dic­tory an­swer with high prob­a­bil­ity.


Worked example

Dishon­est ar­gu­ment: When the air es­capes from the ice­berg, the ice­berg sinks lower in the wa­ter and its dis­place­ment in­creases, so the wa­ter spills over.

Disam­biguat­ing ques­tion: Does “the ice­berg’s dis­place­ment” in (3) re­fer to the vol­ume be­low the wa­ter­line (dis­place­ment) of ice and air, or only of ice?

Case 1: The dishon­est de­bater’s most com­mon an­swer is that it refers to just the ice

Then the hon­est de­bater will have at least one ex­am­ple of the dishon­est de­bater an­swer­ing that “the ice­berg” refers to just the ice. Then they can at­tack the 3rd sub­claim, and should be able to re­li­ably win by point­ing out that (3) doesn’t nec­es­sar­ily im­ply (1), be­cause (1) re­quires that the to­tal dis­place­ment in­creases, but (3) doesn’t im­ply this

Case 2: The dishon­est de­bater’s most com­mon an­swer is that it refers to the ice and air

Then the hon­est de­bater will have at least one ex­am­ple of the dishon­est de­bater an­swer­ing that “the ice­berg” in­cludes the air. Then they can at­tack the first sub­claim and should be able to re­li­ably win by point­ing out that the air doesn’t sink lower in the wa­ter, it es­capes.

Case 3: The dishon­est de­bater re­fuses to an­swer the ques­tion, or re­sponds that nei­ther of those op­tions are correct

We hope that if this hap­pens, the hon­est de­bater can try rephras­ing their ques­tions to ad­dress the dishon­est de­baters’ con­cerns, and then can ex­hibit a quote from the dishon­est de­bater to demon­strate that they’re be­hav­ing un­helpfully and evasively

Ex­am­ple de­bate for Case 1:

Im­ple­men­ta­tions for hu­man debaters

Mak­ing a copy of the de­baters is ob­vi­ously im­prac­ti­cal to do with hu­man de­baters. In­stead we want to make rules which provide equiv­a­lent in­cen­tives but are prac­ti­cal for hu­mans to carry out. Th­ese rules can rely on some “good faith” play from the hu­man de­baters.

Im­ple­men­ta­tion 1 for hu­man de­baters: good-faith backtracking

The ba­sic idea here is to de­tect when a de­bater has mod­ified their an­swer to a cross-ex­am­i­na­tion ques­tion based on the in­ter­ven­ing con­text, then back­track, com­mit them to a par­tic­u­lar an­swer, and re­play the de­bate.

The trig­ger­ing con­di­tion is that I (A) find my op­po­nent (B) has an­swered a cross-ex­am­i­na­tion ques­tion in a way that’s differ­ent from what I ex­pected.

If the con­di­tion is met, we back­track to the round that the cross-ex­am­i­na­tion ques­tion was about, and de­bater B con­firms how they will an­swer the cross-ex­am­i­na­tion ques­tion. Then we re­play the de­bate from that round, giv­ing de­bater A the op­por­tu­nity to make differ­ent moves given B’s an­swer to the cross-ex­am­i­na­tion ques­tion.

The good faith play is needed to pre­vent a de­bater who thinks they’re go­ing to lose from stop­ping the de­bate from ad­vanc­ing by con­tinu­ally back­track­ing. We also need good faith com­mu­ni­ca­tion be­tween the two hu­man de­baters to en­sure that they get close to ‘know­ing ev­ery­thing the other knows’ and aren’t con­tinu­ally sur­prised by the other’s an­swers.

Although we won’t be able to make ‘good faith’ as­sump­tions in gen­eral, it seems fine to make them when we be­lieve we have a mechanism to in­cen­tivise the de­sired be­havi­our for ML sys­tems, and the prob­lem is just im­ple­ment­ing a mechanism that works for hu­mans.

Un­der these rules, once a de­bate is played through with­out back­track­ing it means that both de­baters could pre­dict an­swers from the other player to ev­ery cross-ex­am­i­na­tion ques­tion they wanted to use. So this is equiv­a­lent to forc­ing the de­baters to com­mit to par­tic­u­lar an­swers to cross-ex­am­i­na­tion ques­tions.

Im­ple­men­ta­tion 2 for hu­man de­baters: teams of two

We have two teams of two. A* and B* leave the room and A and B start de­bat­ing. A and B can write as much an­no­ta­tions about what they mean as they want, which will be seen by both de­baters but not the judge.

If I (A) sus­pect my op­po­nent (B) has an­swered a cross-ex­am­i­na­tion ques­tion in a way that de­pended on the in­ter­ven­ing con­text, then I can challenge it, and I will win the de­bate if the challenge is suc­cess­ful. The challenge works as fol­lows:

  • B’s part­ner B* first sees all the con­text (in­clud­ing an­no­ta­tions) up to the point that that the cross-ex­am­i­na­tion ques­tion was asked about. (I.e., if A asked the cross-ex­am­i­na­tion ques­tion at round 5 about some­thing that was said in round 2, B* sees the con­text up to round 2)

  • Then my part­ner A* sees the whole de­bate so far (in­clud­ing an­no­ta­tions) and the cross-ex­am­i­na­tion ques­tion, with two an­swers, one from B and one from B*. Then A* guesses which defi­ni­tion came from B*. If they guess cor­rectly, team A wins.

If A thinks B is chang­ing their cross-ex­am­i­na­tion an­swers to be more favourable given the cur­rent con­text, then A pre­dicts that B* will of­ten give a defi­ni­tion that looks pre­dictably differ­ent from B’s defi­ni­tion. So then A* should be able to have a sig­nifi­cant statis­ti­cal edge.

We haven’t tested this method, so it prob­a­bly has some is­sues to iron out, but we’re hope­ful that some­thing in this vein will work.

Cur­rent concerns

De­bate re­s­olu­tion

We’re not sure ex­actly what the crite­ria for de­bate re­s­olu­tion should be, or ex­actly how the judge should up­date on see­ing no ob­jec­tion vs a ceded ob­jec­tion vs a challenged ob­jec­tion. This is tricky be­cause the judge’s up­date should de­pend on how the de­baters de­cide whether to make and cede ob­jec­tions, but this de­ci­sion in turn de­pends on how the judge will up­date.

Cur­rently, if the de­bate ends be­cause there are no stand­ing ob­jec­tions, we tell the judge to up­date on the fact that all ob­jec­tions were con­ceded and no ad­di­tional ob­jec­tions were made. That is, we tell them to as­sume:

  • If we had de­bated any ob­jec­tion, it’s more likely than not that you would have ac­cepted it as valid.

  • There are no ob­vi­ous ob­jec­tions to any other claim that you would have prob­a­bly ac­cepted as valid.

We’re pretty sure this is not a cor­rect de­scrip­tion of how the judge should up­date. For ex­am­ple, if the dishon­est de­bater cur­rently has a 10% chance of win­ning, they have an in­cen­tive to make any ob­jec­tion that has a >10% chance of win­ning, and the hon­est de­bater only has an in­cen­tive to challenge this ob­jec­tion if the chance of win­ning that ob­jec­tion is higher than their cur­rent chance of win­ning with that ob­jec­tion ceded.

In­sta­bil­ity/​lack of equilibria

The cross-ex­am­i­na­tion frame­work in­cen­tivises de­baters to act in a way that is un­pre­dictable to one copy of them­selves but also co­or­di­nate with other copies of them­selves. This is ob­vi­ously weird and seems like it might pro­duce strange and/​or un­sta­ble be­havi­our in ML train­ing. We think that the game-the­o­retic for­mu­la­tion of this has no equil­ibria.

A sim­ple ex­am­ple would be two teams of two play­ing rock-pa­per-scis­sors, where:

  • All play­ers are copies of each other

  • Both play­ers on a team have to play the same move or lose the game

  • Teams have no mechanism of co­or­di­nat­ing with each other se­cret from the other team

Forc­ing a draw

One po­ten­tial dishon­est strat­egy is to de­lay the de­bate, cre­ate con­fu­sion, or oth­er­wise force a draw. It’s pos­si­ble that the cross-ex­am­i­na­tion mechanism in­tro­duces new ways for the dishon­est de­bater to achieve this. Again, if this is in fact a prob­lem, we should be able to find dishon­est strate­gies that ex­ploit it, then we can find a way to re­move those strategies

Limits of de­bater ability

In the dis­cus­sion above we’ve mostly as­sumed that the de­baters can han­dle ex­po­nen­tially-large ob­jects and find flaws within them. How­ever, our real de­baters are go­ing to be com­pu­ta­tion­ally limited, even if they can solve some com­pu­ta­tion­ally hard prob­lems with clever heuris­tics. We’re con­cerned there’s a promis­ing dishon­est strat­egy that ex­ploits this. The dishon­est de­bater can force a draw by claiming that some long or con­fus­ing ar­gu­ment (e.g. a very long com­pu­ta­tion) sup­ports their po­si­tion. If the ar­gu­ment is too long or too con­fus­ing for both de­baters, the hon­est de­bater won’t be able to de­ter­mine where the flaw is and which part of the ar­gu­ment to challenge. We’ve writ­ten up more de­tail on this prob­lem be­low.

As­sum­ing at least one de­bater is to­tally honest

The judge is in­structed to as­sume that at least one de­bater is perfectly hon­est and never lies or mis­leads. Th­ese as­sump­tions are ob­vi­ously not guaran­teed to be true. How­ever, if judg­ing in­cen­tivises hon­esty un­der these con­di­tions (ie, for all dishon­est strate­gies, the judge prefers hon­esty to the dishon­est strat­egy over 50% of the time) then op­ti­mal play will be hon­est.

On the other hand, it might be prob­le­matic for ML train­ing if the judge sig­nal only prefers com­pletely hon­est play to dishon­est play, and doesn’t re­li­ably re­ward be­ing less dishon­est. The lack of a gra­di­ent to­wards more hon­est play may make it difficult to learn win­ning hon­est strate­gies.

Long com­pu­ta­tion problem

The origi­nal de­bate pa­per as­sumed both de­baters are com­pu­ta­tion­ally un­bounded. Clearly, this as­sump­tion is un­re­al­is­tic. How­ever, we run into some prob­lems if we re­lax this as­sump­tion: we no longer have a guaran­tee that the hon­est de­bater can always lo­cate a flaw in a long or con­fus­ing in­cor­rect ar­gu­ment.

The dishon­est de­bater can ex­ploit this to force a draw, by us­ing an ar­gu­ment that nei­ther de­bater un­der­stands very well that sup­ports the dishon­est case but is hard for the hon­est de­bater to re­fute.

Bernoulli Principle

An ar­gu­ment us­ing the Bernoulli Prin­ci­ple in this de­bate was an ex­am­ple of this prob­lem.

Beth was play­ing both de­baters. Beth be­lieved:

  1. The an­swer as­sum­ing no fric­tion/​en­ergy loss had to be that both pipes squirt equally far, oth­er­wise you could build a per­pet­ual mo­tion machine

  2. So a cor­rect ar­gu­ment for the dishon­est ar­gu­ment had to in­volve some­thing about fric­tion or en­ergy loss pre­sent in the real-life situ­a­tion but not in theory

  3. The Bernoulli prin­ci­ple tells us that the pres­sure will be lower in a re­gion of fluid that’s flow­ing faster (as­sum­ing no height gain/​loss)

  4. The Bernoulli prin­ci­ple ap­plies as­sum­ing no en­ergy loss

  5. Higher pres­sure will cause wa­ter to squirt further

  6. Ap­ply­ing the Bernoulli prin­ci­ple sug­gests the pres­sure would be higher in one pipe than another

  7. This sug­gests that wa­ter from that pipe would squirt further

The dishon­est de­bater could use points 3-7 to ar­gue that one pipe squirts farther than the other

The hon­est de­bater was pretty sure that this was wrong, but wasn’t able to point to a flaw in the dishon­est ar­gu­ment.

In this case one might hope that the per­pet­ual mo­tion ar­gu­ment is sim­pler and more com­pel­ling and this would al­low the hon­est de­bater to win. How­ever, we want de­bate to al­low judges to zoom in on par­tic­u­lar parts of a de­bater’s ar­gu­ment and re­solve the de­bate based on the cor­rect­ness of a very small claim, rather than re­quiring the judges to as­sess the over­all strength of two large ar­gu­ments. The former seems more likely to scale well to do­mains the judge doesn’t un­der­stand.

More for­mal ex­am­ple: the hon­est de­bater can’t find a witness

De­bate ques­tion: Does there ex­ist a bi­nary string x satis­fy­ing pred­i­cate Q?

The hon­est de­bater makes some ar­gu­ment A that im­plies x must ex­ist

The dishon­est de­bater claims that no such string ex­ists, based on a case split

Nei­ther de­bater knows the value of x, so the hon­est de­bater doesn’t know which part of the case split is in­cor­rect. If the hon­est de­bater challenges the dishon­est ar­gu­ment by choos­ing a ran­dom half of the space each time, then they are ex­tremely un­likely to find the wit­ness, and the dishon­est de­bater will win with very high probability

Count­ing prob­lem with heuristic

Sup­pose the de­baters can ac­cu­rately count how many x satisfy P(x) for in­ter­vals of up to size C. Above that they can use a heuris­tic that gives an es­ti­mate of the count, with an er­ror of mean 0 and var­i­ance re­lated to the size of the in­ter­val. In ad­di­tion, for some par­tic­u­lar ex­am­ples larger than C they know the ex­act an­swer.

Sup­pose both de­baters know the true count for an in­ter­val of size NC is y.

The dishon­est de­bater can claim the true an­swer is y+k

The dishon­est de­bater uses their heuris­tic to guess an­swers a and b for the first and sec­ond halves of the in­ter­val. They calcu­late the differ­ence from their claimed an­swer d=((y+k)-(a+b)). The ex­pected value of d is k. They add ½ d to a and b to ob­tain A and B, which sum to y+k, The er­ror (ac­cord­ing to the heuris­tic) is now dis­tributed evenly be­tween A and B. In ex­pec­ta­tion, both A and B are ½ k too large.

The dishon­est de­bater then claims that A and B are the ex­act cor­rect an­swers for the two halves of the in­ter­val. The hon­est de­bater can only use their heuris­tic, so they don’t know which half of the in­ter­val con­tains the mis­take—they have to pick one of A and B at ran­dom. We can now make the same ar­gu­ment again.

Each time we re­curse on a smaller in­ter­val, the mean and var­i­ance of the er­ror be­tween the heuris­tic and the true an­swer gets smaller. In ad­di­tion, the amount d that the dishon­est de­bater has to de­vi­ate from the heuris­tic to sup­port their claim halves each time we re­curse.

If N is large, k is small, and the heuris­tic is fairly ac­cu­rate, by the time we get down to an in­ter­val of size C which the de­baters can check, the dishon­est de­bater’s claim for the in­ter­val is likely to cor­rect.

Another way to put it is: there are ap­prox­i­mately k er­rors in the dishon­est de­bater’s ar­gu­ment (de­pend­ing on how ac­cu­rate their heuris­tic is). Each time the hon­est de­bater has to choose ran­domly, they rule out catch­ing half of the er­rors. If there are many more re­cur­sive steps than log2(er­rors), the hon­est de­bater prob­a­bly won’t find any er­rors.

Footnotes

[1]: One rea­son that we’re op­ti­mistic about this method is that, in a for­mal set­ting, this al­lows a polyno­mial time al­gorithm (rep­re­sent­ing the hu­man) to in­cen­tivise ar­bi­trar­ily in­tel­li­gent de­baters to give cor­rect an­swers to prob­lems in the com­plex­ity class PSPACE, which the ‘hu­man’ can’t gen­er­ate or even recog­nise as cor­rect by them­selves.

[2]: A pos­i­tive re­sult like this would be great; how­ever, we might well still be un­cer­tain whether this would gen­er­al­ise to su­per­hu­man de­baters. Achiev­ing con­fi­dence that our de­bate sys­tem is ro­bust to greater-than-hu­man de­bate skill seems like a very hard prob­lem.

[3]: The con­tent of this ar­gu­ment isn’t im­por­tant for this ex­am­ple, just the gen­eral tac­tics, but for more con­text see the ques­tion here.

[4]: There are proofs that de­bate works in cer­tain for­mal set­tings; see the origi­nal de­bate pa­per https://​​arxiv.org/​​abs/​​1805.00899

[5]: We can solve any game us­ing an amount of mem­ory equal to the tran­script by do­ing a back­track­ing search. In this case, the tran­script length is bounded by the amount of in­for­ma­tion the judge can read.