David Africa comments on Suggestions for improving debate protocols in AI safety

David Africa 29 May 2026 11:23 UTC
6 points
3
I used to do a bunch of British, Asian, and Australasian parliamentary debate. I both debated and judged, winning the biggest debate tournament in the world and chairing the open finals as a judge in the next year. My impression both of these formats and the policy (and other bespoke torturous US formats) debaters I’ve talked to is that such formats, WRT scalable oversight via debate (as opposed to the inference-time multi-agent debate setup), are more insightful about debate’s (empirical) failure modes/missing gaps than about its potential.
One good example of this is judge biases (h/t Kozzy Voudouris and Simon Marshall); if you anticipate using the debate setup as a training env, then there is a good analogy between this setup and meta-trends in debating, where people end up using buzzwords like “structural reasons” incorrectly or speaking very quickly because judges reward. A similar thing seems to happen in debate, having run a few experiments on this myself.
Another example of this is speaker roles (Joan mentioned this to me). AI debaters have a hard time taking on the “incentives” of being a debater, and to argue in an impassioned, aggressive way. They tend to be too sycophantic and eager to agree. But training in the persona of an aggressive debater might be (1) undesirable, and leak bad propensities, (2) not straightforward, as there are many ways of doing this such as SDF or character training, and (3) lead to judge hacking, if misspecified. And of course in human debating, anecdotally many people have become quite unpleasant after extended periods of time debating.
- tr5tn 29 May 2026 11:47 UTC
  1 point
  0
  Parent
  @David Africa thanks! Many of those points are certainly worth focusing on. For what it’s worth, I was also an awarded speaker in the Model UN, but I found that format to be far more arbitrary, susceptible to being gamed by speaking skill and rhetoric, and IMO less likely to arrive at something desirable (I led an uprising of militarised third-party countries to vote down all disarmament proposals).
  Ultimately, the actual plans, counterplans, kritiks and topicality discussions in policy debate are ridiculous. Every debater I’ve met would acknowledge that. And ultimately, it is a game, so IMO that is to be expected. So I am certainly not agitating for AI Safety outcomes that resemble policy debate verdicts, but I think the game itself is good reference since one of the current problems in AI Safety debate protocols is that they are being gamed. Distinctions between Constructives, Rebuttals and Cross-Examination are really fundamental to policy debate, and we get similar constructs in legal proceedings.
  I’m conscious this is all one American reference and I’m not offering empirical findings, but I do think the field should consider these games as reference protocols (as well as cross-cultural legal rules) since they are a result of refinement over decades. They are practices that already exist.
  --
  Edited to add: not wishing to be dismissive of your empirical findings. I’d love to read more about the difficulties with training or inference-time persona adoption, but also, I don’t know that current negative findings should preclude focus on those problems.
  - David Africa 29 May 2026 11:58 UTC
    6 points
    1
    Parent
    I also think MUN is bad.
    I think you should mentally model the use case of AI safety debate protocols being in high stakes settings, and that after applying tons of optimisation pressure, for the AI debates to look very different from human debates in the limit. SO debate protocols in particular try to take advantage of self-play, which you can’t do well with humans, so introducing asymmetry through additional roles and rules may make it hard to reason about theoretically (and also possibly weaken the benefit of self-play). So they’d need to be pretty well motivated.
    - tr5tn 29 May 2026 12:16 UTC
      1 point
      0
      Parent
      I take the point that in the self-play context this could drift off-course! I suppose (linking this back to the MATS research) I’m suggesting it would be good to measure that beside a more naïve protocol.