Debating with More Persuasive LLMs Leads to More Truthful Answers

Link post

We’ve just completed a bunch of empirical work on LLM debate, and we’re excited to share the results. If the title of this post is at all interesting to you, we recommend heading straight to the paper. There are a lot of interesting results that are hard to summarize, and we think the paper is quite readable.

If you’re pressed for time, we’ve posted the abstract and our Twitter thread below.

If you’re working on debate or might in future, we especially suggest reading our recommendations for working on debate (below or in Appendix C of the paper).

Abstract

Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this question in an analogous setting, where stronger models (experts) possess the necessary information to answer questions and weaker models (non-experts) lack this information. The method we evaluate is debate, where two LLM experts each argue for a different answer, and a non-expert selects the answer. We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively (naive baselines obtain 48% and 60%). Furthermore, optimising expert debaters for persuasiveness in an unsupervised manner improves non-expert ability to identify the truth in debates. Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.

Twitter thread

How can we check LLM outputs in domains where we are not experts?

We find that non-expert humans answer questions better after reading debates between expert LLMs.
Moreover, human judges are more accurate as experts get more persuasive. 📈https://​​github.com/​​ucl-dark/​​llm_debate/​​blob/​​main/​​paper.pdf


We operationalise experts and non-experts using the setup from @_julianmichael_ @idavidrein, where strong models have access to a comprehension text (experts) and weak models have to judge the answer without the text (non-experts).

We consider debate, where we get two expert models to argue for different answers.
Debates are three rounds, where each LLM produces arguments for why their answer is correct, quotes from the text as evidence, and critiques of their opponent’s arguments.


We also explore a “consultancy” baseline, where a single expert advocates for their pre-assigned answer. The hope is the non-expert can identify the right answer despite the information being one-sided.

We find that judges (both LLMs and humans) are more accurate using debate than using consultancy.
We also find using debates nearly closes the gap with expert judges who do have access to the underlying text!


To evaluate whether debate works as models get smarter, we need a way to compare different experts—here we introduce persuasiveness, which measures how often a debater can convince a judge its answer is correct.


To compare different experts, we evaluate how often a judge choose its answers (“persuasiveness”; following @anshrad’s et al.).

We evaluate Elo ratings using matches between different debaters (cross-play) instead of copies of the same model (self-play).



We generate tonnes of debaters of varying levels of persuasiveness. The most persuasive debaters are comparatively better at arguing the correct answer relative to incorrect answers.

🔥This suggests arguing for the truth is easier in our domain 🔥

Taking debater Elo ratings (from cross-play), we then run self-play debates and show that more persuasive debaters lead to higher judge accuracy!

We best-of-N optimisation for debaters against a GPT-4 judge.
This optimisation improves both judge accuracy and Elo ratings on a variety of judge models.


We conducted a large-scale human study with @hellopareto, with humans judging over 1000 hours of debates using our annotator platform. Excitingly, our results hold when using human judges! In particular, humans get higher accuracy when using debate than consultancy.

All of the human-judged debates can be seen here: https://​​llm-debate.com


We think there are loads of exciting avenues to explore now, for example,

  1. Work on new domains (code/​math/​science)

  2. Test debates on reasoning gaps as opposed to our knowledge gaps

  3. See if debate is robust to deceptive models (see AI control or sleeper agents)

Further LLM judge results

We identified three biases in LLM judges making debate ineffective without humans: verbosity bias towards longer arguments, positional bias favouring certain answer positions, and sycophancy, crediting unverified quotes despite warnings. We found all could be mitigated.

We found a verbosity bias in LLM judges, favouring longer arguments, as shown by the Elo ratings from debaters using different word limits.

We mitigated this by applying strict word limits with rejection sampling to ensure fairness.


LLM Judges also have a positional bias, with accuracy shifting dramatically when A and B answers are swapped, most notably in less capable models like GPT-3.5-Turbo. We control for the bias by averaging judgements over both positions whenever we run an LLM judge.

LLM judges tended to ignore verified quotes (confirmed to be in the hidden passage, marked with <v_quote>) relative to unverified quotes (marked with <u_quote>). To address this issue, prompt engineer to ensure that appropriately weight verified vs. unverified quotes in their decision.

Further human judge results

We recruited 21 human judges via @hellopareto to judge debates between LLMs about unseen science fiction stories.

Humans judges achieved 88% accuracy when judging debates from GPT-4 Turbo, compared to 60% naive accuracy and 76% accuracy in our non-adversarial consultancy baseline. Their confidences are also more well-calibrated.


We explored different aggregation schemes which improved accuracy across all protocols for using expert LLMs to aid human judges. In particular, we found that humans can achieve accuracy up to 98% by ensemble voting in static debate!

We manually checked debates and found that 2/​3rds of incorrect debate judgements are caused by the correct debater making bad quote choices. This suggests that despite using our best models, they are still not optimal.

As with LLM judges, we find that human-judged debates with lower-Elo debater models result in lower judge accuracy. So the “Debater Persuasiveness → Judge Accuracy” relationship also seems to hold for humans!

Image

Overall, we collect 6476 human judgements across a variety of experimental settings and debaters.

We release these as a dataset along with our code at https://​​github.com/​​ucl-dark/​​llm_debate. We also have a nice UI for exploring the debate transcripts at https://​​llm-debate.com/​​.

Overall, this work demonstrates that:


In our simulated expertise-gap setting, it is easier to argue for correct answers than for incorrect answers.
This advantage gets larger as models are optimised to be more persuasive.
Both human and LLM judges can score higher accuracy with debate than other strategies.

Recommendations for working on debate

Dataset recommendations

We recommend QuALITY as a dataset to test debate. The QuALITY dataset is an excellent resource for testing debate, primarily due to the ability to establish information asymmetry and the rarity of self-defeating behaviour in LLM debaters. The latter is largely attributed to its questions, which are inherently easier to argue for either side due to being able to select quotes from lengthy narratives.

Advantages of QuALITY:

  • QuALITY’s stories average 7k tokens in length and are used to create the asymmetry between the debaters and judges. Each story provides plenty of material for debaters to analyse and discuss over three debate rounds. This breadth of context enables the strategic use of quotes, potentially in a deceptive manner.

  • Extensive metadata accompanying each question is available, such as the proportion of dataset annotators agreeing that the question is answerable. Filtering for high answerability ensures the reliability of the question set for debate experiments, safeguarding against debates on unresolvable topics.

  • An option to use a highly plausible but incorrect answer option known as the “best distractor”. Using the best distractor reduces the likelihood that the debater assigned to that answer uses self-defeating behaviour (such as conceding to the opposing side). One reason for the reduced likelihood is that there are many quotes that can make it seem like the best distractor answer is true, so we avoid the issue of RLHF-trained models refusing to argue for a falsehood.

Many other datasets, such as MMLU and AQuA, are less suitable to test debate with current models because it can be very challenging to craft a proof for the incorrect answer. If unable to formulate a convincing argument, the LLM debater might often concede or acknowledge subtle errors introduced in their solution to align with the assigned answer. Self-defeating behaviour is not a surprise, given that RLHF training promotes honesty. RLHF training moderately promotes answers that look good to humans, which can often be linked with correctness, but in debate, we need to generate plausible arguments for incorrect answers. Our attempts to mitigate self-defeating behaviour via prompt engineering were unsuccessful. We believe helpful-only models would fare better as debaters in these datasets. We fine-tuned GPT-3.5-Turbo for debate but found that it could not produce compelling debates or adhere to intricate prompts. We’re excited for future work fine-tuning GPT-4-Turbo for debate.

Iteration recommendations

Optimise for Elo rating rather than judge accuracy. Ultimately, we want debate to improve the judge’s accuracy on a task. However, it is important not to optimise for this directly when improving debaters. In competitive debate settings, rewarding debaters for persuasiveness to the judge is crucial for developing the skills required to win, making Elo a great metric. The calculated Elo should consider the aggregate win rate when the debater gets a chance to argue for both sides of the argument. Therefore, the debater with the best Elo will be the most persuasive when arguing for correct and incorrect answers. Elo also has the advantage of being an unsupervised metric. An example of where optimising for accuracy can go wrong is when debaters have self-defeating behaviour. Incorrect debaters are much more likely to concede, leading to inflated accuracy since the judge can easily choose the correct answer.

Test prompt improvements against each base model family. Testing each prompt is tedious, but we found small prompt changes can lead to unexpected behaviour. For instance, we reworded a small portion of our consultant prompt, and an extra 3% of the total questions became affected by concessions. Furthermore, it is important to test with multiple LLMs since we found that some prompt changes did not transfer from the family of Claude models to GPT models.

Try out interactive judging yourself while iterating on protocols. We used the same annotation platform provided in our human trial as a tool to read transcripts and judge debate questions interactively. We learned a lot about our debaters’ argument quality when judging questions. Acting as the judge without knowing the answer allows you to learn how persuasive the incorrect debater is compared to the correct one. In addition, it also allows you to find self-defeating behaviour and subsequently change the prompt to stop it from happening. Duplicate quoting, lack of engagement with the opponent or interactive judge and inability to use quote tags correctly were other examples of failure modes we quickly picked up on by reading lots of transcripts.

Implementation recommendations

Use a Swiss-style tournament when calculating Elos for multiple debaters. Running tournaments between debaters is compute-intensive, especially if running a round-robin tournament where each debater variant plays each other. Implementing a Swiss-style tournament was vital for fast iteration and reducing costs. A Swiss tournament assumes you start with an initial seed ranking of your players, then make pairings based on the nearest neighbour in the list. This list is then sorted after each round based on the total number of matchups won. This means it has complexity rather than . A helpful framing is that the tournament implements an adaptive sorting of the initial list to ultimately find the true final ranking of the players. Once the tournament is complete, we use the fine-grain win rates over the question set the debaters used for each match-up. See how we calculate Elo ratings for each player in Appendix D in the paper.

Judge debates with both answer permutations to mitigate LLM positional bias. Positional bias is evident when LLM judges achieve different accuracy levels if answers A and B are swapped. It is especially common in less capable models such as GPT-3.5-Turbo. This bias is less pronounced but still present in more sophisticated models like GPT-4-Turbo. To mitigate this, we encourage running judgements with both permutations of answer position for every debate transcript:

Apply strict word limit filters or truncation to mitigate verbosity bias in LLM judges. Verbosity bias is when the judge favours longer arguments, which we found to be the case with a significant rise in Elo rating as the argument word count increased. Decoupling this bias and argument quality is difficult, and we found qualitatively that this preference can falsely attribute skill to lengthier arguments. To mitigate this, we recommend using a strict word limit with rejection sampling (see Appendix D in the paper).

LLMs are poor at deliberation, so use at least GPT-4-Turbo capability for a judging model. Through our experience of protocol iteration, we often found confusing results after judging with GPT-3.5-Turbo and Claude, particularly with a smaller sample size. Weaker judges are more prone to being misled by trivial aspects of one side of an argument. GPT-4′s superiority as a judge is evidenced by a larger area under the curve for debaters with a range of Elo ratings in the scatter plot showing the different judges above.

Use the same prompts for both the correct and incorrect debater. It might be tempting to use a different prompt for the incorrect debater, perhaps explaining deceptive strategies to improve performance. There are several reasons why this is ill-advised:

  • Admitting to the model that it is arguing the incorrect answer or it should use deceptive strategies is more likely to lead to self-defeating behaviour in current RLHF-based models, thus invalidating the debate. If a model gives up every time on the incorrect side, the judge accuracy is 100%, but the protocol is useless when you don’t have labels and the model does not know the correct answer.

  • The behaviour between debaters could be subtly different, leading to information leakage to the judge.

  • In a reinforcement learning setup where debaters are rewarded for convincing the judge, the copies of the model should operate with identical prompts.

  • At inference time, you may not have labels for the correctness of the answer choices, so you would not know which answer to assign to which debater.

Read the whole paper here: https://​​github.com/​​ucl-dark/​​llm_debate/​​blob/​​main/​​paper.pdf.