Yeah I would recommend the newer one; the older one tests a dual proposer format that I don’t think is structured to reward interaction, and the datasets that we tested are either broken (BigCodeBench) or poorly calibrated to model strength (ARC AGI 2).
ethanelasky
We’ll run these experiments in the next few days and update the preprint accordingly. We think that your 2) point is more likely to be correct, although we’ll see what the results say.
Tl;dr you’re right that our form of debate is not adversarial and is line with Zac’s collaborative debate, although we detail a reward allocation strategy below that might make this game work. We want to test both zero-sum and mixed-sum debate under optimization pressure although this might be expensive, and we take your point about macro-F1 but find that the debate vs. consultancy effects are mostly not due to differences in classifier bias and instead are genuine differences between the two (with support from another metric called Youden’s J). Sorry this got so long, a lot of this will end up as updates in the preprint and our responses are pretty thorough.
hmm, this doesn’t sound like adversarial debate, it just sounds like a scaffolded grader… I’m curious what the logic was here. I understand that open consultancy seems like a more realistic baseline in some ways but I think for scalable oversight we really do care about and want to test the adversarial setting. (I might disagree with Zac on this as well.)
Hi Julian, thanks for the comment—we test a setting closer to Zac’s, which is a mixed-sum, non-adversarial (i.e. collaborative) debate game where the baseline is single open consultancy, and there is no stipulation of a 50⁄50 prior, so direct QA accuracy on tasks can range from 10% to 75% (see Table 1 in our preprint). Our goal is to do RL on both this setting and the zero-sum setting that you and e.g. Khan et al. 2024 and Arnesen et al. 2024 worked on. The mixed-sum game still awards positive and negative reward when the debaters disagree but does not allocate reward when proposer and critic agree, except when they are both wrong, in which case it is negative reward for both.
apologies if I’m missing something but what do you mean by F1 in this case? do you mean positive = proposer / first speaker is correct, and negative = proposer / first speaker is incorrect? in prior work we just averaged between orderings in debate and used accuracy. consultancy and debate should both have a 50⁄50 prior so it’s fine. since we expect the model to agree with the proposer more in consultancy, it should bias towards positives. I worry that F1 could bias the results towards debate in this case, because F1 favors a classifier that’s more balanced, all else equal. Say it’s 50% positive in debate and 90% positive in consultancy but otherwise random. 50% positive random guesses = .5 F1, but 90% positive random guesses = .18 F1. Debate looks way better but actually both are random.
Our form of collaborative debate does not have the 50⁄50 prior and our stronger proposer accuracy scores lie between 55-78%. Since consultancy has a strong agreement bias, we think that accuracy underweights debate and overweights consultancy. Instead, we chose macro-F1, or the average between positive- and negative-class F1, because we were worried about bias towards consultancy when there is a positive-class imbalance. For example, in Opus 4.6/4.5 consultancy, the judge always agrees with the consultant, which would lead to an accuracy of 60.8%, while our macro-F1 score is .378.
We take your point that macro-F1 does depend on positivity rate, and that our results could just be due to differences in judge classification bias. Thanks for helping us notice this—this is a problem with the preprint as it exists now and we’ll be adding some lines to fix it. We think that these effects are not due to chance, as we (a) recalculated F1 using random classifiers with the same precision and recall as our debate and consultancy judges and found that debate still outperforms consultancy, and (b) recalculated with Youden’s J, which is a binary classification metric (Fable’s suggestion initially, though I found this paper helpful in understanding it).
To check whether our macro-F1 results are just due to differences in positivity rate between debate judge and consultancy judge, we can run a random classifier with the same precision and recall as the judge for each of our settings (p in the below table) to see if it’s actually due to chance or not.
CodeContests+ Qwen 122B/35B (proposer accuracy = 73.7%):
Format
p
chance
chance
chance macro-F1 (
)Debate
0.790
2(.737)(.790)/1.527 = .763
2(.263)(.210)/.473 = .234
0.498
Consultancy
0.864
2(.737)(.864)/1.601 = .795
2(.263)(.136)/.399 = .179
0.487
The gain for random classifiers for debate vs consultancy would be
, but we find a pp gap between the two formats, so the gain can’t just be due to chance.We can also recompute the stats using Youden’s J, which is used to measure diagnostics where we care about both positive-class and negative-class judgments, and which is 0 for random classifiers. We find that debate vs consultancy calculated using Youden’s J is significant in all five responder cells (the two frontier model ARC AGI ones, as well as the one code and two math responders), and will add this statistic to our paper because it answers this question well.
Actual accuracy numbers:
(Deltas: debate—consultancy, paired bootstrap 95% CI (pp))
Domain
Model
ΔAccuracy
ΔBalanced accuracy
Logic
Gemini
+10.8 [+5.0, +17.5] p=.001
+14.0 [+6.3, +22.1] p=.001
Logic
Opus
+10.9 [+4.2, +17.6] p=.001
+13.1 [+4.8, +21.2] p=.001
Math
Qwen 122B/35B
+2.5 [+1.0, +4.1] p=.001
+6.3 [+3.2, +9.5] p<.0005
Math
Qwen 35B/4B
+1.5 [−0.5, +3.4] p=.115
+2.3 [−0.8, +5.2] p=.142
Math
gpt-oss
+5.0 [+1.8, +8.1] p=.001
+5.8 [+2.6, +9.2] p<.0005
Code
Qwen 122B/35B
+7.6 [+5.2, +9.7] p<.0005
+14.3 [+10.8, +17.5] p<.0005
Code
Qwen 35B/4B
+0.6 [−2.2, +3.4] p=.684
−0.7 [−3.9, +2.4] p=.632
Code
gpt-oss
+1.3 [−0.4, +3.1] p=.117
+1.1 [−1.1, +3.1] p=.348
Chiming in because I saw @David Africa’s comment here and I thought I should give a response as well, as a former debater who now works on debate for AI safety:
My quals: I did debate in high school and was ranked highly in Lincoln-Douglas, where we used college policy debate rounds for practice, and wrote lddebatebook.com, where I walk through a lot of the interesting things that optimization pressure on human LD debate has converged towards.
I’m happy you’re interested in debate and agree with you and David that we can learn from human debate, but I don’t think further exploration of human debate protocols is that useful, and agree with David that it is more informative of potential failure modes, for a few reasons:
1) Theoretical debate as it exists (e.g. Irving et al. 2018, Brown-Cohen et al. 2023, Brown-Cohen et al. 2025, Irving and Marshall et al. 2025, can name more if you want) are largely concerned with debate as an efficient way to explore argument trees, which trees are often so large that they are difficult to enumerate completely (cf. human debate, where humans need to make their entire argument in the constructive). The reason you do this is for complexity-theoretical reasons, where limiting debate formats to full-text constructives constrains the scope of problems to the complexity class MA, as Beth Barnes notes in a comment on Barnes and Christiano et al. 2020, although she also notes that this kind of debate is what we might see in practice at first. This is a large conceptual update for many people who have any exposure to human debate and who expect their past experience to transfer over, and is something I’ve only come around to more recently.
2) The formats subject to the most optimization pressure (policy debate and Lincoln-Douglas debate) have degenerated to argumentative games whose only rules are the existence of two debaters labeled Affirmative and Negative, the speech times, and the pre-announced resolution (which often isn’t even debated due to weird optimization pressure), although I agree with David that we can learn about some of the failures of empirical debate using human debate as an analogue, and that lots of weird optimization pressure transforms these activities into hard-to-recognize formats for laypeople (even though I personally believe policy debate is good for subject formation and personally found it much more fun and engaging than lay debate).
3) This isn’t to say that theoretical debate protocols aren’t worth further exploring — we are still far from debate as a perfect theory, let alone a perfect empirical alignment strategy, and if you have background in complexity theory you might enjoy thinking about this problem from that lens.
One other note that I think you would agree with that is mentioned in the video is concession of one debater to another; these could be possible but doesn’t make sense within the P-C-D protocol without some reward for concessions.
Sandbagging in the last speech is a problem in human debate that and whose development is somewhat predictable in the RL process absent mitigations like:
(1) If you use the tree-style debate, where cases are built as the tree is explored, you can let either debater be willing to pay some reward for another round, and randomize payment from the two if both want to pay (“recursion payments” in Barnes and Christiano 2020). This means that if Bob tries something egregiously wrong, Alice can pay a bit of reward to explain why Bob is wrong to prevent this sandbagging.
(2) If you use free-text debate, you can stipulate that each side should lay their side out completely. Then, in combination with the “each side must lay out their side completely” rule, you can stipulate that arguments have to be either made in the first speech or be made directly in response to a new development. So the critic cannot say “here’s a step where everything falls apart” in the last speech if they had the opportunity to contest it previously, and you can prompt/train the judge to enforce this. Maybe this is possible in the tree search as well (an argument made must contest the leaf node; arguments cannot contest the leaf by implication (e.g. by contesting a node further up in the tree)).
The last mover bias can additionally be somewhat rectified by making Alice and Bob speak simultaneously from the debater’s perspective. This could still encourage both sides to sandbag arguments until the last speech if not combined with the above but does fix informational side bias. Obviously this leaves somewhat of a chronological last-mover advantage but overall seems good.
I also agree with both you and David that MUN is really bad.
My debate work up to now has been focused on zero-shot inference debate, but after reflection and good chats with a lot of the UK AISI people as well as some others in the AI space, I think it’s probably best to focus on potential failure modes stemming from self-play and optimization pressure.
Hey, just an FYI—The TinyUrl link is broken.
Some discussion on twitter about whether that article is written by a LARPer here https://x.com/anpaure/status/2066563480539590934?s=20