Debate (AI safety technique)

TagLast edit: 6 Feb 2023 0:35 UTC by plex

Debate is a proposed technique for allowing human evaluators to get correct and helpful answers from experts, even if the evaluator is not themselves an expert or able to fully verify the answers.^[1] The technique was suggested as part of an approach to build advanced AI systems that are aligned with human values, and to safely apply machine learning techniques to problems that have high stakes, but are not well-defined (such as advancing science or increase a company’s revenue). ^[2]^[3]

Briefly thinking through some analogs of debate

Eli Tyre11 Sep 2022 12:02 UTC

20 points

3 comments4 min readLW link

Writeup: Progress on AI Safety via Debate

Beth Barnes and paulfchristiano

5 Feb 2020 21:04 UTC

103 points

18 comments33 min readLW link

A guide to Iterated Amplification & Debate

Rafael Harth15 Nov 2020 17:14 UTC

75 points

12 comments15 min readLW link

[Question] How should AI debate be judged?

abramdemski15 Jul 2020 22:20 UTC

49 points

26 comments6 min readLW link

AI Safety via Debate

ESRogs5 May 2018 2:11 UTC

27 points

14 comments1 min readLW link

(blog.openai.com)

Debate update: Obfuscated arguments problem

Beth Barnes23 Dec 2020 3:24 UTC

138 points

24 comments16 min readLW link

An alignment safety case sketch based on debate

Marie_DB, Jacob Pfau, Benjamin Hilton and Geoffrey Irving

8 May 2025 15:02 UTC

57 points

21 comments25 min readLW link

(arxiv.org)

Thoughts on AI Safety via Debate

Vaniver9 May 2018 19:46 UTC

35 points

12 comments6 min readLW link

Optimal play in human-judged Debate usually won’t answer your question

Joe Collman27 Jan 2021 7:34 UTC

33 points

12 comments12 min readLW link

Clarifying Factored Cognition

Rafael Harth13 Dec 2020 20:02 UTC

23 points

2 comments3 min readLW link

When the Smarter AI Lies Better: Can Debate-Based Oversight Catch Deceptive Code

oskarkraak6 Jul 2025 1:21 UTC

4 points

0 comments5 min readLW link

(oskarkraak.com)

Synthesizing amplification and debate

evhub5 Feb 2020 22:53 UTC

33 points

10 comments4 min readLW link

AXRP Episode 6 - Debate and Imitative Generalization with Beth Barnes

DanielFilan8 Apr 2021 21:20 UTC

26 points

3 comments60 min readLW link

Rejecting Violence as an AI Safety Strategy

James_Miller22 Sep 2025 16:34 UTC

63 points

5 comments3 min readLW link

FC final: Can Factored Cognition schemes scale?

Rafael Harth24 Jan 2021 22:18 UTC

17 points

0 comments17 min readLW link

The limits of AI safety via debate

Marius Hobbhahn10 May 2022 13:33 UTC

36 points

8 comments10 min readLW link

Take 9: No, RLHF/IDA/debate doesn’t solve outer alignment.

Charlie Steiner12 Dec 2022 11:51 UTC

33 points

13 comments2 min readLW link

An overview of 11 proposals for building safe advanced AI

evhub29 May 2020 20:38 UTC

221 points

37 comments38 min readLW link 2 reviews

Thoughts on “AI safety via debate”

Gordon Seidoh Worley10 May 2018 0:44 UTC

12 points

4 comments5 min readLW link

Looking for adversarial collaborators to test our Debate protocol

Beth Barnes19 Aug 2020 3:15 UTC

52 points

5 comments1 min readLW link

Reward button alignment

Steven Byrnes22 May 2025 17:36 UTC

50 points

15 comments12 min readLW link

Three mental images from thinking about AGI debate & corrigibility

Steven Byrnes3 Aug 2020 14:29 UTC

55 points

35 comments4 min readLW link

Deception Chess: Game #2

Zane29 Nov 2023 2:43 UTC

29 points

17 comments2 min readLW link

AXRP Episode 16 - Preparing for Debate AI with Geoffrey Irving

DanielFilan1 Jul 2022 22:20 UTC

20 points

0 comments37 min readLW link

Why I’m excited about Debate

Richard_Ngo15 Jan 2021 23:37 UTC

75 points

12 comments7 min readLW link

[New LW Feature] “Debates”

Ruby, RobertM, GPT-4 and Claude+

1 Apr 2023 7:00 UTC

121 points

35 comments1 min readLW link

A Small Negative Result on Debate

Sam Bowman12 Apr 2022 18:19 UTC

42 points

11 comments1 min readLW link

Imitative Generalisation (AKA ‘Learning the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC

107 points

15 comments11 min readLW link 1 review

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC

127 points

9 comments15 min readLW link

Traversing a Cognition Space

Rafael Harth7 Dec 2020 18:32 UTC

17 points

5 comments12 min readLW link

Comparing AI Alignment Approaches to Minimize False Positive Risk

Gordon Seidoh Worley30 Jun 2020 19:34 UTC

5 points

0 comments9 min readLW link

Splitting Debate up into Two Subsystems

Nandi3 Jul 2020 20:11 UTC

13 points

5 comments4 min readLW link

Why I’m not working on {debate, RRM, ELK, natural abstractions}

Steven Byrnes10 Feb 2023 19:22 UTC

74 points

19 comments10 min readLW link

Idealized Factored Cognition

Rafael Harth30 Nov 2020 18:49 UTC

34 points

6 comments11 min readLW link

Inference-Only Debate Experiments Using Math Problems

Arjun Panickssery, Abhimanyu Pallavi Sudhir and JacksonKaunismaa

6 Aug 2024 17:44 UTC

31 points

0 comments2 min readLW link

Problems with AI debate

Stuart_Armstrong26 Aug 2019 19:21 UTC

21 points

3 comments5 min readLW link

Questions about Value Lock-in, Paternalism, and Empowerment

Sam F. Brown16 Nov 2022 15:33 UTC

13 points

2 comments12 min readLW link

(sambrown.eu)

Topological Debate Framework

lunatic_at_large16 Jan 2025 17:19 UTC

10 points

5 comments9 min readLW link

Dodging systematic human errors in scalable oversight

Geoffrey Irving14 May 2025 15:19 UTC

33 points

3 comments4 min readLW link

Debate, Oracles, and Obfuscated Arguments

Jonah Brown-Cohen and Geoffrey Irving

20 Jun 2024 23:14 UTC

44 points

4 comments21 min readLW link

Debate Minus Factored Cognition

abramdemski29 Dec 2020 22:59 UTC

37 points

42 comments11 min readLW link

Notes on OpenAI’s alignment plan

Alex Flint8 Dec 2022 19:13 UTC

40 points

5 comments7 min readLW link

New paper: (When) is Truth-telling Favored in AI debate?

VojtaKovarik26 Dec 2019 19:59 UTC

32 points

7 comments5 min readLW link

(medium.com)

Control Vectors as Dispositional Traits

Gianluca Calcagni23 Jun 2024 21:34 UTC

11 points

0 comments12 min readLW link

EchoSeed: GlyphChains, Collapse Laws, and a Framework for Bearing Consequences

retreat00026 Jul 2025 20:35 UTC

1 point

0 comments1 min readLW link

On the deep (uncurable?) vulnerability of MCPs

awu19 Jul 2025 2:50 UTC

5 points

6 comments1 min readLW link

(www.generalanalysis.com)

Can We Trust the Judge? A novel method of Modelling Human Bias and Systematic Error in Debate-Based Scalable Oversight

Andreea Zaman19 Jul 2025 21:44 UTC

1 point

0 comments7 min readLW link

Using Older AI Models as a Form of Boycott

Jacob121 Jul 2025 12:18 UTC

6 points

2 comments1 min readLW link

From Unruly Stacks to Organized Shelves: Toy Model Validation of Structured Priors in Sparse Autoencoders

Yuxiao6 Jul 2025 7:03 UTC

9 points

0 comments5 min readLW link

Anthropic Fall 2023 Debate Progress Update

Ansh Radhakrishnan28 Nov 2023 5:37 UTC

76 points

9 comments12 min readLW link

An AI-in-a-box success model

azsantosk11 Apr 2022 22:28 UTC

16 points

1 comment10 min readLW link

Surprised by ELK report’s counterexample to Debate, IDA

Evan R. Murphy4 Aug 2022 2:12 UTC

22 points

0 comments5 min readLW link

Hybrid Reflective Learning Systems (HRLS): From Fear-Based Safety to Ethical Comprehension

Petra Vojtaššáková22 Oct 2025 22:06 UTC

1 point

0 comments4 min readLW link

Could LLM Hallucination Be a Learned Artifact of Virality-Weighted Corpora?

Gizmet27 Oct 2025 23:58 UTC

1 point

0 comments2 min readLW link

Parallels Between AI Safety by Debate and Evidence Law

Cullen20 Jul 2020 22:52 UTC

10 points

1 comment2 min readLW link

(cullenokeefe.com)

From Oragnized Shelves to Layered Catalogs: Architectural Explorations for Sparse Autoencoders—Crosscoders & Ladder SAEs Towards Hierarchical Data Structure

Yuxiao10 Aug 2025 10:12 UTC

2 points

1 comment11 min readLW link

Empathy bandaid for immediate AI catastrophe

installgentoo5 Apr 2023 2:12 UTC

1 point

2 comments1 min readLW link

AI debate: test yourself against chess ‘AIs’

Richard Willis22 Nov 2023 14:58 UTC

26 points

35 comments4 min readLW link

AI Safety 101 - Chapter 5.1 - Debate

Charbel-Raphaël31 Oct 2023 14:29 UTC

15 points

0 comments13 min readLW link

Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets

Abhimanyu Pallavi Sudhir16 Sep 2024 1:04 UTC

5 points

2 comments5 min readLW link

truth.integrity(): A Recursive Framework for Hallucination Prevention and Alignment

brittneyluong2 Apr 2025 17:52 UTC

1 point

0 comments2 min readLW link

[Question] Enhanced Clarity to Bridge the AI Labeling Gap?

Pathways26 Jan 2025 6:48 UTC

1 point

0 comments1 min readLW link

A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives

Knight Lee14 Apr 2025 10:27 UTC

−3 points

2 comments4 min readLW link

Evaluating Superhuman Models with Consistency Checks

Daniel Paleka and Lukas Fluri

1 Aug 2023 7:51 UTC

21 points

2 comments9 min readLW link

(arxiv.org)

Superposition Checkers: A Game Where AI’s Strengths Become Fatal Flaws

R. A. McCormack6 Apr 2025 0:57 UTC

1 point

0 comments2 min readLW link

Arguing for the Truth? An Inference-Only Study into AI Debate

denisemester11 Feb 2025 3:04 UTC

7 points

0 comments16 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC

15 points

0 comments27 min readLW link

AI Debate Stability: Addressing Self-Defeating Responses

Annie Sorkin11 Jun 2024 3:03 UTC

9 points

0 comments3 min readLW link

LLM Sycophancy: grooming, proto-sentience, or both?

gturner413 Oct 2025 0:58 UTC

1 point

0 comments2 min readLW link

The Trinity Model: Toward a Framework for Decision Integrity and Recursive Trust

praveenshiraaaa96-Reflection5 Sep 2025 6:58 UTC

1 point

0 comments1 min readLW link

AI Safety Debate and Its Applications

VojtaKovarik23 Jul 2019 22:31 UTC

38 points

5 comments11 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC

58 points

0 comments59 min readLW link

Rant on Problem Factorization for Alignment

johnswentworth5 Aug 2022 19:23 UTC

106 points

53 comments7 min readLW link

AI Unsafety via Non-Zero-Sum Debate

VojtaKovarik3 Jul 2020 22:03 UTC

25 points

10 comments5 min readLW link

An Analysis on the P0 Logical Flaw in RLHF: Maximum Rationality and “Logical Suicide”

R. L. Harrison26 Oct 2025 15:18 UTC

1 point

0 comments1 min readLW link

Interpretability is the best path to alignment

Arch2235 Sep 2025 4:37 UTC

2 points

6 comments5 min readLW link

The “AI Debate” Debate

michaelcohen2 Jul 2020 10:16 UTC

20 points

20 comments3 min readLW link

“Artificial Remorse: A Proposal for Safer AI Through Simulated Regret”

Sérgio Geraldes21 Sep 2025 21:50 UTC

−1 points

0 comments2 min readLW link

Debate AI and the Decision to Release an AI

Chris_Leong17 Jan 2019 14:36 UTC

9 points

18 comments3 min readLW link

Alignment via prosocial brain algorithms

Cameron Berg12 Sep 2022 13:48 UTC

45 points

30 comments6 min readLW link

NYU Debate Training Update: Methods, Baselines, Preliminary Results

samarnesen6 Jul 2024 18:28 UTC

9 points

0 comments20 min readLW link

# Emotion Is Structure: Toward Recursive Alignment Through Human–AI Co-Creation

thesignalthatcouldntbeheard3 Aug 2025 5:19 UTC

1 point

0 comments3 min readLW link

AI-Written Critiques Help Humans Notice Flaws

paulfchristiano25 Jun 2022 17:22 UTC

137 points

5 comments3 min readLW link

(openai.com)

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan, John Hughes, Dan Valentine, Sam Bowman and Ethan Perez

7 Feb 2024 21:28 UTC

89 points

14 comments9 min readLW link

(arxiv.org)

From Messy Shelves to Master Librarians: Toy-Model Exploration of Block-Diagonal Geometry in LM Activations

Yuxiao19 Jul 2025 12:26 UTC

6 points

1 comment4 min readLW link

[Question] Beyond Benchmarks: A Psychometric Approach to AI Evaluation

Kareem Soliman27 Jul 2025 16:09 UTC

1 point

0 comments8 min readLW link

Learning the smooth prior

Geoffrey Irving, Rohin Shah and evhub

29 Apr 2022 21:10 UTC

35 points

0 comments12 min readLW link

Alignment Gaps

kcyras8 Jun 2024 15:23 UTC

11 points

4 comments8 min readLW link

Alignment with argument-networks and assessment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC

10 points

5 comments45 min readLW link

NYU Code Debates Update/Postmortem

David Rein24 May 2024 16:08 UTC

27 points

4 comments10 min readLW link

Risk Tokens: Economic Security in AI Safety

mhdempsey15 Jun 2025 19:25 UTC

1 point

0 comments6 min readLW link

(www.michaeldempsey.me)

How a Non-Dual Language Could Redefine AI Safety

Marcio Díaz23 Aug 2025 16:40 UTC

1 point

6 comments3 min readLW link

Prover-Estimator Debate: A New Scalable Oversight Protocol

Jonah Brown-Cohen and Geoffrey Irving

17 Jun 2025 13:53 UTC

88 points

18 comments5 min readLW link

On scalable oversight with weak LLMs judging strong LLMs

zac_kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner and Rohin Shah

8 Jul 2024 8:59 UTC

49 points

18 comments7 min readLW link

(arxiv.org)

Embracing complexity when developing and evaluating AI responsibly

Aliya Amirova11 Oct 2024 17:46 UTC

3 points

9 comments9 min readLW link

Debate helps supervise human experts [Paper]

habryka17 Nov 2023 5:25 UTC

29 points

6 comments1 min readLW link

(github.com)

Looking for feedback on proposed AI health risk scoring framework

Yasmin27 Sep 2025 19:29 UTC

1 point

0 comments1 min readLW link

Can there be an indescribable hellworld?

Stuart_Armstrong29 Jan 2019 15:00 UTC

39 points

19 comments2 min readLW link

OpenAI Credit Account (2510$)

Emirhan BULUT21 Jan 2024 2:32 UTC

1 point

0 comments1 min readLW link

A Grounded UX Layer for LLMs That Could Prevent Real Harm

ParityMind11 Jul 2025 18:19 UTC

1 point

0 comments1 min readLW link

Zach Stein-Perlman 16 Dec 2024 21:32 UTC
2 points
0
The more important zoom-level is: debate is a proposed technique to provide a good training signal. See e.g. https://www.lesswrong.com/posts/eq2aJt8ZqMaGhBu3r/zach-stein-perlman-s-shortform?commentId=DLYDeiumQPWv4pdZ4.
Edit: debate is a technique for iterated amplification—but that tag is terrible too, oh no