RSS

De­bate (AI safety tech­nique)

TagLast edit: 6 Feb 2023 0:35 UTC by plex

Debate is a proposed technique for allowing human evaluators to get correct and helpful answers from experts, even if the evaluator is not themselves an expert or able to fully verify the answers.[1] The technique was suggested as part of an approach to build advanced AI systems that are aligned with human values, and to safely apply machine learning techniques to problems that have high stakes, but are not well-defined (such as advancing science or increase a company’s revenue). [2][3]

  1. ^
  2. ^
  3. ^

Briefly think­ing through some analogs of debate

Eli Tyre11 Sep 2022 12:02 UTC
20 points
3 comments4 min readLW link

Wri­teup: Progress on AI Safety via Debate

5 Feb 2020 21:04 UTC
100 points
18 comments33 min readLW link

A guide to Iter­ated Am­plifi­ca­tion & Debate

Rafael Harth15 Nov 2020 17:14 UTC
72 points
9 comments15 min readLW link

De­bate up­date: Obfus­cated ar­gu­ments problem

Beth Barnes23 Dec 2020 3:24 UTC
135 points
24 comments16 min readLW link

Thoughts on AI Safety via Debate

Vaniver9 May 2018 19:46 UTC
35 points
12 comments6 min readLW link

[Question] How should AI de­bate be judged?

abramdemski15 Jul 2020 22:20 UTC
49 points
26 comments6 min readLW link

AI Safety via Debate

ESRogs5 May 2018 2:11 UTC
27 points
14 comments1 min readLW link
(blog.openai.com)

Op­ti­mal play in hu­man-judged De­bate usu­ally won’t an­swer your question

Joe_Collman27 Jan 2021 7:34 UTC
33 points
12 comments12 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
205 points
36 comments38 min readLW link2 reviews

A Small Nega­tive Re­sult on Debate

Sam Bowman12 Apr 2022 18:19 UTC
42 points
11 comments1 min readLW link

The limits of AI safety via debate

Marius Hobbhahn10 May 2022 13:33 UTC
29 points
7 comments10 min readLW link

Split­ting De­bate up into Two Subsystems

Nandi3 Jul 2020 20:11 UTC
13 points
5 comments4 min readLW link

AXRP Epi­sode 16 - Prepar­ing for De­bate AI with Ge­offrey Irving

DanielFilan1 Jul 2022 22:20 UTC
17 points
0 comments37 min readLW link

Three men­tal images from think­ing about AGI de­bate & corrigibility

Steven Byrnes3 Aug 2020 14:29 UTC
55 points
35 comments4 min readLW link

Syn­the­siz­ing am­plifi­ca­tion and debate

evhub5 Feb 2020 22:53 UTC
33 points
10 comments4 min readLW link

Look­ing for ad­ver­sar­ial col­lab­o­ra­tors to test our De­bate protocol

Beth Barnes19 Aug 2020 3:15 UTC
52 points
5 comments1 min readLW link

Thoughts on “AI safety via de­bate”

Gordon Seidoh Worley10 May 2018 0:44 UTC
12 points
4 comments5 min readLW link

Take 9: No, RLHF/​IDA/​de­bate doesn’t solve outer al­ign­ment.

Charlie Steiner12 Dec 2022 11:51 UTC
33 points
14 comments2 min readLW link

Clar­ify­ing Fac­tored Cognition

Rafael Harth13 Dec 2020 20:02 UTC
23 points
2 comments3 min readLW link

Why I’m not work­ing on {de­bate, RRM, ELK, nat­u­ral ab­strac­tions}

Steven Byrnes10 Feb 2023 19:22 UTC
70 points
18 comments9 min readLW link

[New LW Fea­ture] “De­bates”

1 Apr 2023 7:00 UTC
113 points
34 comments1 min readLW link

Com­par­ing AI Align­ment Ap­proaches to Min­i­mize False Pos­i­tive Risk

Gordon Seidoh Worley30 Jun 2020 19:34 UTC
5 points
0 comments9 min readLW link

Ideal­ized Fac­tored Cognition

Rafael Harth30 Nov 2020 18:49 UTC
34 points
6 comments11 min readLW link

Travers­ing a Cog­ni­tion Space

Rafael Harth7 Dec 2020 18:32 UTC
17 points
5 comments12 min readLW link

De­cep­tion Chess: Game #2

Zane29 Nov 2023 2:43 UTC
29 points
17 comments2 min readLW link

Imi­ta­tive Gen­er­al­i­sa­tion (AKA ‘Learn­ing the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC
103 points
15 comments12 min readLW link1 review

Why I’m ex­cited about Debate

Richard_Ngo15 Jan 2021 23:37 UTC
75 points
12 comments7 min readLW link

FC fi­nal: Can Fac­tored Cog­ni­tion schemes scale?

Rafael Harth24 Jan 2021 22:18 UTC
17 points
0 comments17 min readLW link

AXRP Epi­sode 6 - De­bate and Imi­ta­tive Gen­er­al­iza­tion with Beth Barnes

DanielFilan8 Apr 2021 21:20 UTC
26 points
3 comments60 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
127 points
9 comments15 min readLW link

An­thropic Fall 2023 De­bate Progress Update

Ansh Radhakrishnan28 Nov 2023 5:37 UTC
72 points
9 comments12 min readLW link

OpenAI Credit Ac­count (2510$)

Emirhan BULUT21 Jan 2024 2:32 UTC
1 point
0 comments1 min readLW link

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

7 Feb 2024 21:28 UTC
86 points
13 comments9 min readLW link
(arxiv.org)

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments26 min readLW link

De­bate Minus Fac­tored Cognition

abramdemski29 Dec 2020 22:59 UTC
37 points
42 comments11 min readLW link

Can there be an in­de­scrib­able hel­l­world?

Stuart_Armstrong29 Jan 2019 15:00 UTC
39 points
19 comments2 min readLW link

Em­pa­thy bandaid for im­me­di­ate AI catastrophe

installgentoo5 Apr 2023 2:12 UTC
1 point
2 comments1 min readLW link

De­bate helps su­per­vise hu­man ex­perts [Paper]

habryka17 Nov 2023 5:25 UTC
29 points
6 comments1 min readLW link
(github.com)

AI de­bate: test your­self against chess ‘AIs’

Richard Willis22 Nov 2023 14:58 UTC
26 points
35 comments4 min readLW link

Par­allels Between AI Safety by De­bate and Ev­i­dence Law

Cullen20 Jul 2020 22:52 UTC
10 points
1 comment2 min readLW link
(cullenokeefe.com)

AI Safety De­bate and Its Applications

VojtaKovarik23 Jul 2019 22:31 UTC
38 points
5 comments12 min readLW link

New pa­per: (When) is Truth-tel­ling Fa­vored in AI de­bate?

VojtaKovarik26 Dec 2019 19:59 UTC
32 points
7 comments5 min readLW link
(medium.com)

Prob­lems with AI debate

Stuart_Armstrong26 Aug 2019 19:21 UTC
21 points
3 comments5 min readLW link

Eval­u­at­ing Su­per­hu­man Models with Con­sis­tency Checks

1 Aug 2023 7:51 UTC
15 points
2 comments9 min readLW link
(arxiv.org)

AI Safety 101 - Chap­ter 5.1 - Debate

Charbel-Raphaël31 Oct 2023 14:29 UTC
10 points
0 comments13 min readLW link

An AI-in-a-box suc­cess model

azsantosk11 Apr 2022 22:28 UTC
16 points
1 comment10 min readLW link

Learn­ing the smooth prior

29 Apr 2022 21:10 UTC
35 points
0 comments12 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
53 points
0 comments59 min readLW link

AI-Writ­ten Cri­tiques Help Hu­mans No­tice Flaws

paulfchristiano25 Jun 2022 17:22 UTC
137 points
5 comments3 min readLW link
(openai.com)

Sur­prised by ELK re­port’s coun­terex­am­ple to De­bate, IDA

Evan R. Murphy4 Aug 2022 2:12 UTC
18 points
0 comments5 min readLW link

Rant on Prob­lem Fac­tor­iza­tion for Alignment

johnswentworth5 Aug 2022 19:23 UTC
87 points
51 comments6 min readLW link

De­bate AI and the De­ci­sion to Re­lease an AI

Chris_Leong17 Jan 2019 14:36 UTC
9 points
18 comments3 min readLW link

Align­ment via proso­cial brain algorithms

Cameron Berg12 Sep 2022 13:48 UTC
42 points
28 comments6 min readLW link

The “AI De­bate” Debate

michaelcohen2 Jul 2020 10:16 UTC
20 points
20 comments3 min readLW link

AI Un­safety via Non-Zero-Sum Debate

VojtaKovarik3 Jul 2020 22:03 UTC
25 points
10 comments5 min readLW link

Ques­tions about Value Lock-in, Pa­ter­nal­ism, and Empowerment

Sam F. Brown16 Nov 2022 15:33 UTC
13 points
2 comments12 min readLW link
(sambrown.eu)

Notes on OpenAI’s al­ign­ment plan

Alex Flint8 Dec 2022 19:13 UTC
40 points
5 comments7 min readLW link

Align­ment with ar­gu­ment-net­works and as­sess­ment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC
10 points
5 comments45 min readLW link
No comments.