Debate (AI safety technique)

TagLast edit: 6 Feb 2023 0:35 UTC by plex

Debate is a proposed technique for allowing human evaluators to get correct and helpful answers from experts, even if the evaluator is not themselves an expert or able to fully verify the answers.^[1] The technique was suggested as part of an approach to build advanced AI systems that are aligned with human values, and to safely apply machine learning techniques to problems that have high stakes, but are not well-defined (such as advancing science or increase a company’s revenue). ^[2]^[3]

Briefly thinking through some analogs of debate

Eli Tyre11 Sep 2022 12:02 UTC

20 points

3 comments4 min readLW link

Writeup: Progress on AI Safety via Debate

Beth Barnes and paulfchristiano

5 Feb 2020 21:04 UTC

100 points

18 comments33 min readLW link

A guide to Iterated Amplification & Debate

Rafael Harth15 Nov 2020 17:14 UTC

75 points

12 comments15 min readLW link

Debate update: Obfuscated arguments problem

Beth Barnes23 Dec 2020 3:24 UTC

135 points

24 comments16 min readLW link

Thoughts on AI Safety via Debate

Vaniver9 May 2018 19:46 UTC

35 points

12 comments6 min readLW link

[Question] How should AI debate be judged?

abramdemski15 Jul 2020 22:20 UTC

49 points

26 comments6 min readLW link

AI Safety via Debate

ESRogs5 May 2018 2:11 UTC

27 points

14 comments1 min readLW link

(blog.openai.com)

Optimal play in human-judged Debate usually won’t answer your question

Joe_Collman27 Jan 2021 7:34 UTC

33 points

12 comments12 min readLW link

An overview of 11 proposals for building safe advanced AI

evhub29 May 2020 20:38 UTC

211 points

36 comments38 min readLW link 2 reviews

A Small Negative Result on Debate

Sam Bowman12 Apr 2022 18:19 UTC

42 points

11 comments1 min readLW link

The limits of AI safety via debate

Marius Hobbhahn10 May 2022 13:33 UTC

29 points

7 comments10 min readLW link

Splitting Debate up into Two Subsystems

Nandi3 Jul 2020 20:11 UTC

13 points

5 comments4 min readLW link

AXRP Episode 16 - Preparing for Debate AI with Geoffrey Irving

DanielFilan1 Jul 2022 22:20 UTC

20 points

0 comments37 min readLW link

Three mental images from thinking about AGI debate & corrigibility

Steven Byrnes3 Aug 2020 14:29 UTC

55 points

35 comments4 min readLW link

Synthesizing amplification and debate

evhub5 Feb 2020 22:53 UTC

33 points

10 comments4 min readLW link

Looking for adversarial collaborators to test our Debate protocol

Beth Barnes19 Aug 2020 3:15 UTC

52 points

5 comments1 min readLW link

Thoughts on “AI safety via debate”

Gordon Seidoh Worley10 May 2018 0:44 UTC

12 points

4 comments5 min readLW link

Take 9: No, RLHF/IDA/debate doesn’t solve outer alignment.

Charlie Steiner12 Dec 2022 11:51 UTC

33 points

14 comments2 min readLW link

Clarifying Factored Cognition

Rafael Harth13 Dec 2020 20:02 UTC

23 points

2 comments3 min readLW link

Why I’m not working on {debate, RRM, ELK, natural abstractions}

Steven Byrnes10 Feb 2023 19:22 UTC

71 points

19 comments9 min readLW link

[New LW Feature] “Debates”

Ruby, RobertM, GPT-4 and Claude+

1 Apr 2023 7:00 UTC

113 points

34 comments1 min readLW link

Comparing AI Alignment Approaches to Minimize False Positive Risk

Gordon Seidoh Worley30 Jun 2020 19:34 UTC

5 points

0 comments9 min readLW link

Idealized Factored Cognition

Rafael Harth30 Nov 2020 18:49 UTC

34 points

6 comments11 min readLW link

Traversing a Cognition Space

Rafael Harth7 Dec 2020 18:32 UTC

17 points

5 comments12 min readLW link

Deception Chess: Game #2

Zane29 Nov 2023 2:43 UTC

29 points

17 comments2 min readLW link

Imitative Generalisation (AKA ‘Learning the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC

107 points

15 comments11 min readLW link 1 review

Why I’m excited about Debate

Richard_Ngo15 Jan 2021 23:37 UTC

75 points

12 comments7 min readLW link

FC final: Can Factored Cognition schemes scale?

Rafael Harth24 Jan 2021 22:18 UTC

17 points

0 comments17 min readLW link

AXRP Episode 6 - Debate and Imitative Generalization with Beth Barnes

DanielFilan8 Apr 2021 21:20 UTC

26 points

3 comments60 min readLW link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC

127 points

9 comments15 min readLW link

Alignment via prosocial brain algorithms

Cameron Berg12 Sep 2022 13:48 UTC

44 points

28 comments6 min readLW link

The “AI Debate” Debate

michaelcohen2 Jul 2020 10:16 UTC

20 points

20 comments3 min readLW link

AI Unsafety via Non-Zero-Sum Debate

VojtaKovarik3 Jul 2020 22:03 UTC

25 points

10 comments5 min readLW link

Questions about Value Lock-in, Paternalism, and Empowerment

Sam F. Brown16 Nov 2022 15:33 UTC

13 points

2 comments12 min readLW link

(sambrown.eu)

Notes on OpenAI’s alignment plan

Alex Flint8 Dec 2022 19:13 UTC

40 points

5 comments7 min readLW link

Alignment with argument-networks and assessment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC

10 points

5 comments45 min readLW link

Anthropic Fall 2023 Debate Progress Update

Ansh Radhakrishnan28 Nov 2023 5:37 UTC

74 points

9 comments12 min readLW link

OpenAI Credit Account (2510$)

Emirhan BULUT21 Jan 2024 2:32 UTC

1 point

0 comments1 min readLW link

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan, John Hughes, Dan Valentine, Sam Bowman and Ethan Perez

7 Feb 2024 21:28 UTC

87 points

14 comments9 min readLW link

(arxiv.org)

Alignment Gaps

kcyras8 Jun 2024 15:23 UTC

10 points

3 comments8 min readLW link

AI Debate Stability: Addressing Self-Defeating Responses

Annie Sorkin11 Jun 2024 3:03 UTC

9 points

0 comments3 min readLW link

NYU Code Debates Update/Postmortem

David Rein24 May 2024 16:08 UTC

26 points

4 comments10 min readLW link

On scalable oversight with weak LLMs judging strong LLMs

zac_kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner and Rohin Shah

8 Jul 2024 8:59 UTC

48 points

18 comments7 min readLW link

(arxiv.org)

NYU Debate Training Update: Methods, Baselines, Preliminary Results

samarnesen6 Jul 2024 18:28 UTC

9 points

0 comments20 min readLW link

Control Vectors as Dispositional Traits

Gianluca Calcagni23 Jun 2024 21:34 UTC

3 points

0 comments11 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC

15 points

0 comments27 min readLW link

Debate Minus Factored Cognition

abramdemski29 Dec 2020 22:59 UTC

37 points

42 comments11 min readLW link

Can there be an indescribable hellworld?

Stuart_Armstrong29 Jan 2019 15:00 UTC

39 points

19 comments2 min readLW link

Empathy bandaid for immediate AI catastrophe

installgentoo5 Apr 2023 2:12 UTC

1 point

2 comments1 min readLW link

Debate helps supervise human experts [Paper]

habryka17 Nov 2023 5:25 UTC

29 points

6 comments1 min readLW link

(github.com)

AI debate: test yourself against chess ‘AIs’

Richard Willis22 Nov 2023 14:58 UTC

26 points

35 comments4 min readLW link

Parallels Between AI Safety by Debate and Evidence Law

Cullen20 Jul 2020 22:52 UTC

10 points

1 comment2 min readLW link

(cullenokeefe.com)

AI Safety Debate and Its Applications

VojtaKovarik23 Jul 2019 22:31 UTC

38 points

5 comments12 min readLW link

New paper: (When) is Truth-telling Favored in AI debate?

VojtaKovarik26 Dec 2019 19:59 UTC

32 points

7 comments5 min readLW link

(medium.com)

Problems with AI debate

Stuart_Armstrong26 Aug 2019 19:21 UTC

21 points

3 comments5 min readLW link

Evaluating Superhuman Models with Consistency Checks

Daniel Paleka and Lukas Fluri

1 Aug 2023 7:51 UTC

21 points

2 comments9 min readLW link

(arxiv.org)

AI Safety 101 - Chapter 5.1 - Debate

Charbel-Raphaël31 Oct 2023 14:29 UTC

14 points

0 comments13 min readLW link

An AI-in-a-box success model

azsantosk11 Apr 2022 22:28 UTC

16 points

1 comment10 min readLW link

Learning the smooth prior

Geoffrey Irving, Rohin Shah and evhub

29 Apr 2022 21:10 UTC

35 points

0 comments12 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC

53 points

0 comments59 min readLW link

AI-Written Critiques Help Humans Notice Flaws

paulfchristiano25 Jun 2022 17:22 UTC

137 points

5 comments3 min readLW link

(openai.com)

Surprised by ELK report’s counterexample to Debate, IDA

Evan R. Murphy4 Aug 2022 2:12 UTC

18 points

0 comments5 min readLW link

Rant on Problem Factorization for Alignment

johnswentworth5 Aug 2022 19:23 UTC

90 points

51 comments6 min readLW link

Debate AI and the Decision to Release an AI

Chris_Leong17 Jan 2019 14:36 UTC

9 points

18 comments3 min readLW link

No comments.