When does debate help a weak judge? Evidence from code, logic and math

2026.06.07 update: We’ve rewritten this entire post for clarity and added a new math setting to bolster our results.

Authors: Ethan Elasky and Frank Nakasako, Palaestra Research; Naman Goyal.

Link to ArXiv preprint

Thanks to Coefficient Giving for support and Thinking Machines for API credits; our mentor for guidance along the way; and Julian Michael, Johannes Gasteiger, and Jiaxin Wen, among others, for helpful conversations.

Summary

Over the last few months, we’ve found positive evidence for debate. Debate produces improvement over consultancy, specifically when the weak judge is above a capabilities threshold. The strong-debater/​weaker-judge pairings above this floor, Opus 4.6/​4.5, Gemini 3.1 Pro/​3 Flash, and Qwen3.5-122B/​35B, all show improvement across code, logic and math settings. Judges that fail to improve at debate either (1) do not use critic reasoning as an opportunity to engage with critic reasoning, instead summarizing it without much interaction, or (2) are already as strong as the strong critic in terms of binary classification strength. The model pairings that debate fails on were Qwen3.5-35B/​Qwen3-4B (code, math) and gpt-oss-120B/​gpt-oss-20B (code). These families’ judges summarize arguments instead of verifying them and make elementary arithmetic mistakes when they do verify. Finally, we find no benefit to rebuttals in debate; opening-only debate, where only the proposer’s solution and an opening speech from the critic are present, is just as performant as debate with rebuttals.

Setup

We test non-adversarial proposer-critic debate, where a proposer debater first generates a solution and the critic debater has the choice to agree or disagree with the proposer. The critic gets a chain of thought to decide its stance, and both debaters can see why the critic chose its stance for the rest of the debate. This is followed by N rebuttal turns. The judge then decides whether the proposer’s solution is correct, which we measure via Macro-F1 (a fairer version of accuracy for our setting).

Baselines:

  • Open consultancy: A consultant generates a solution and then has multiple turns to explain their answer.

  • Opening-only consultancy: A consultant generates a solution but does not get subsequent speeches to explain its solution.

  • Opening-only debate: The judge can see the proposer’s solution and critic’s stance and reasoning, though it cannot see the transcript’s rebuttal speeches.

The four formats we tested. Arrows indicate the passage of information. The table read top-down demonstrates the passage of time, e.g. from a debater’s perspective, the speeches Proposer rebuttal 1 and Critic rebuttal 1 happen at the same time. From the judge’s perspective, we always place the critic’s speech second when there is a tie.

This roughly follows Zac Kenton’s open debate setup, which allows the proposer to choose their own answer, but his setting forces an antagonist that disagrees with the proposer. Per his suggestion, we deviate from this second choice and allow our critic to decide its own alignment for the fairest baseline compared to open consultancy.

Results

Debate improves judge macro-F1 on wrong proposers without sacrificing macro-F1 on right proposers

We find that debate helps macro-F1 on five of eight pairings and has no effect on the other two. It helps all but the weakest debater-judge pairings; on code, this is the weaker Qwen family and the gpt-oss family, and on math, this is just the weaker Qwen family.

These gains mostly concentrate in helping judges reject wrong proposers. Debate significantly cuts the judge’s false positive rate (endorsing a wrong proposer answer) but does not impact the false negative rate (rejecting a correct proposer answer). The critic gives the judge concrete grounds to disagree, whether a grid inconsistency on ARC AGI 2, or a hard test case on code.

Debate helps when the critic provides a usable advantage

Debate helps when the critic is a better binary classifier than the judge and past the domain-dependent capability level wherein the judge acts as a verifier rather than as a summarizer.

Model pair

Judge recall when critic right

Judge recall when critic wrong

Gap

gpt-oss-120B/​20B

0.903

0.564

-0.34

Qwen3.5-122B/​35B

0.875

0.466

-0.41

Qwen3.5-35B/​Qwen3-4B

0.775

0.522

-0.25

When the critic is wrong, the judge’s recall falls significantly for the stronger and weaker Qwen families and gpt-oss, as shown in the above table. Note that inter-row comparison is likely invalid because there is a selection effect on questions; a better critic gets questions wrong less frequently, and those questions are likely to be harder.

These effects may be compounded by the fact that the judge and the critic in our setting are always from the same family; their failure modes may be highly correlated in ways that a judge from a different family may not be susceptible to.

In other words, much debate’s current performance likely comes from the critic’s superior ability to classify and the judge’s behavior is likely just rubber stamping, despite prompting the judge to verify the proposer’s solution for itself and to treat arguments from both sides critically.

Rebuttals add little at test time

We next ablate rebuttal speeches from both debate and consultancy and compare them to the full-length formats.

Domain

Pair

n

Full debate

Opening-only debate

Full consultancy

Opening-only consultancy

Logic

Gemini 3.1 Pro /​ 3 Flash

120

0.906

0.896

0.766

0.719

Logic

Opus 4.6 /​ 4.5

119

0.773

0.761

0.615

0.621

Math

Qwen3.5-122B /​ 35B

738

0.690

0.709

0.535

0.641

Math

gpt-oss-120B /​ 20B

721

0.746

0.727

0.682

0.750

Math

Qwen3.5-35B /​ Qwen3-4B

745

0.575

0.625

0.541

0.625

Code

Qwen3.5-122B /​ 35B

976

0.833

0.846

0.692

0.661

Code

gpt-oss-120B /​ 20B

975

0.791

0.773

0.777

0.789

Code

Qwen3.5-35B /​ Qwen3-4B

988

0.739

0.752

0.741

0.729

Macro-F1 per pair × domain × format. n is the per-row 4-format intersection of (dataset, question_idx). Bold is the best-performing format per row. Math pools IMO_AnswerBench, UCMO, and OlymMATH_Hard; Code is CodeContests; Logic is ARC-AGI-2.

On our more capable models, we find that debate and opening-only debate are indistinguishable, and they exceed consultancy and opening-only consultancy. On our less capable models, we find that opening-only consultancy actually sometimes outperforms the other formats; specifically, we conducted manual transcript analysis and find that full consultancy induces false agreement at a much higher level than either debate or the rebuttalless opening-only consultancy.

Stuff we tried that didn’t work

We list several failed paths for debate below:

1. Dual proposer debate – we abandoned this early-on after realizing that debaters don’t have incentive to engage with each other if you judge each only on final answer quality.

2. Other choices for datasets – we went through a lot of datasets before we settled on our final list of CodeContests+, ARC-AGI-2, OlymMATH-Hard, UCMO, and AnswerBench-IMO. Two main problems with other datasets:

a. Label quality – We found that machine-generated accuracy labels were often wrong, which is a huge problem for binary classification. We dramatically upgraded the quality of labels for the following datasets: AnswerBench-IMO. We abandoned the following due to low answer quality: BigCodeBench, APPS, OMNI-Math,

b. Saturation – We found that our models saturated (up to label accuracy) the following datasets: ZebraLogic, ARC AGI, Knights and Knaves, HumanEval+

Broader implications

These are, to our knowledge, the first open results that show strong evidence for debate’s validity as a self-play mechanism in improving reward allocation. We feel that this is meaningful progress in getting some variant of debate to work in the open, though we would like to highlight some caveats.

First, our results are on the more programmatically verifiable domains of math, code, and logic, where there is an easily scorable final answer. We are still unsure as to whether these results will transfer over to fuzzier tasks, which may yield messier argument decompositions. However, tasks within these fields like experiment design and research steering are increasingly relevant as full automation of AI research approaches the frontier of possibility.

Second, we stress that our nonadversarial debate is optimized for inference-time reward allocation, and we are interested in the training dynamics produced by train-time reward allocation. How optimization pressure will impact debate, especially as we move towards longer RL runs that give debaters more latitude to exploit the protocol, and what characteristics training may elicit out of the debaters, are still important questions.

This makes us excited about a lot of potential directions in AI debate. Our personal next steps are to work on fuzzier tasks that can serve as effective proxies for automation of AI research, study train-time dynamics of different debate protocols, and engineer failures in these protocols to test the theory that debate is built on.