AI-Assisted Alignment

TagLast edit: 25 Jan 2024 5:18 UTC by habryka

AI-Assisted Alignment is a cluster of alignment plans that involve AI somehow significantly helping with alignment research. This can include weak tool AI, or more advanced AGI doing original research.

There has been a lot of debate about how practical this alignment approach is.

Other search terms for this tag: AI aligning AI

W2SG: Introduction

Maria Kapros10 Mar 2024 16:25 UTC

−1 points

2 comments10 min readLW link

A Review of Weak to Strong Generalization [AI Safety Camp]

sevdeawesome7 Mar 2024 17:16 UTC

9 points

0 comments9 min readLW link

Alignment in Thought Chains

Faust Nemesis4 Mar 2024 19:24 UTC

1 point

0 comments2 min readLW link

Paper review: “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks”

Vassil Tashev29 Feb 2024 18:44 UTC

11 points

0 comments4 min readLW link

[Question] Can we get an AI to do our alignment homework for us?

Chris_Leong26 Feb 2024 7:56 UTC

53 points

33 comments1 min readLW link

Requirements for a Basin of Attraction to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC

20 points

6 comments31 min readLW link

The Ideal Speech Situation as a Tool for AI Ethical Reflection: A Framework for Alignment

kenneth myers9 Feb 2024 18:40 UTC

6 points

12 comments3 min readLW link

How to Control an LLM’s Behavior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC

64 points

30 comments11 min readLW link

AISC project: How promising is automating alignment research? (literature review)

Bogdan Ionut Cirstea28 Nov 2023 14:47 UTC

4 points

1 comment1 min readLW link

(docs.google.com)

1. A Sense of Fairness: Deconfusing Ethics

RogerDearnaley17 Nov 2023 20:55 UTC

15 points

8 comments15 min readLW link

[Question] Shouldn’t we ‘Just’ Superimitate Low-Res Uploads?

lukemarks3 Nov 2023 7:42 UTC

15 points

2 comments2 min readLW link

Could We Automate AI Alignment Research?

Stephen McAleese10 Aug 2023 12:17 UTC

27 points

10 comments21 min readLW link

[Question] Have you ever considered taking the ‘Turing Test’ yourself?

Super AGI27 Jul 2023 3:48 UTC

2 points

6 comments1 min readLW link

Robustness of Model-Graded Evaluations and Automated Interpretability

Simon Lermen and viluon

15 Jul 2023 19:12 UTC

43 points

5 comments9 min readLW link

How I Learned To Stop Worrying And Love The Shoggoth

Peter Merel12 Jul 2023 17:47 UTC

10 points

9 comments5 min readLW link

OpenAI Launches Superalignment Taskforce

Zvi11 Jul 2023 13:00 UTC

149 points

40 comments49 min readLW link

(thezvi.wordpress.com)

Internal independent review for language model agent alignment

Seth Herd7 Jul 2023 6:54 UTC

53 points

26 comments11 min readLW link

[Linkpost] Introducing Superalignment

beren5 Jul 2023 18:23 UTC

173 points

68 comments1 min readLW link

(openai.com)

Philosophical Cyborg (Part 2)...or, The Good Successor

ukc1001421 Jun 2023 15:43 UTC

21 points

1 comment31 min readLW link

A potentially high impact differential technological development area

Noosphere898 Jun 2023 14:33 UTC

5 points

2 comments2 min readLW link

An LLM-based “exemplary actor”

Roman Leventov29 May 2023 11:12 UTC

16 points

0 comments12 min readLW link

Proposed Alignment Technique: OSNR (Output Sanitization via Noising and Reconstruction) for Safer Usage of Potentially Misaligned AGI

sudo29 May 2023 1:35 UTC

14 points

9 comments6 min readLW link

Some thoughts on automating alignment research

Lukas Finnveden26 May 2023 1:50 UTC

30 points

4 comments6 min readLW link

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC

32 points

3 comments15 min readLW link

Misaligned AGI Death Match

Nate Reinar Windwood14 May 2023 18:00 UTC

1 point

0 comments1 min readLW link

Annotated reply to Bengio’s “AI Scientists: Safe and Useful AI?”

Roman Leventov8 May 2023 21:26 UTC

18 points

2 comments7 min readLW link

(yoshuabengio.org)

How to express this system for ethically aligned AGI as a Mathematical formula?

Oliver Siegel19 Apr 2023 20:13 UTC

−1 points

0 comments1 min readLW link

Davidad’s Bold Plan for Alignment: An In-Depth Explanation

Charbel-Raphaël and Gabin

19 Apr 2023 16:09 UTC

154 points

29 comments21 min readLW link

Scientism vs. people

Roman Leventov18 Apr 2023 17:28 UTC

4 points

4 comments11 min readLW link

Capabilities and alignment of LLM cognitive architectures

Seth Herd18 Apr 2023 16:29 UTC

80 points

18 comments20 min readLW link

Agentized LLMs will change the alignment landscape

Seth Herd9 Apr 2023 2:29 UTC

153 points

95 comments3 min readLW link

[Question] Daisy-chaining epsilon-step verifiers

Decaeneus6 Apr 2023 2:07 UTC

2 points

1 comment1 min readLW link

Introducing AlignmentSearch: An AI Alignment-Informed Conversional Agent

BionicD0LPH1N, Fraser and TheBayesian

1 Apr 2023 16:39 UTC

79 points

14 comments4 min readLW link

AI-assisted alignment proposals require specific decomposition of capabilities

RobertM30 Mar 2023 21:31 UTC

16 points

2 comments6 min readLW link

“Unintentional AI safety research”: Why not systematically mine AI technical research for safety purposes?

ghostwheel29 Mar 2023 15:56 UTC

27 points

3 comments6 min readLW link

We have to Upgrade

Jed McCaleb23 Mar 2023 17:53 UTC

126 points

35 comments2 min readLW link

Exploring the Precautionary Principle in AI Development: Historical Analogies and Lessons Learned

Christopher King21 Mar 2023 3:53 UTC

−1 points

2 comments9 min readLW link

“Carefully Bootstrapped Alignment” is organizationally hard

Raemon17 Mar 2023 18:00 UTC

258 points

22 comments11 min readLW link

Discussion with Nate Soares on a key alignment difficulty

HoldenKarnofsky13 Mar 2023 21:20 UTC

250 points

38 comments22 min readLW link

Why Not Just Outsource Alignment Research To An AI?

johnswentworth9 Mar 2023 21:49 UTC

126 points

47 comments9 min readLW link

Project “MIRI as a Service”

RomanS8 Mar 2023 19:22 UTC

42 points

4 comments1 min readLW link

Introducing AI Alignment Inc., a California public benefit corporation...

TherapistAI7 Mar 2023 18:47 UTC

1 point

4 comments1 min readLW link

Why Not Just… Build Weak AI Tools For AI Alignment Research?

johnswentworth5 Mar 2023 0:12 UTC

156 points

17 comments6 min readLW link

Curiosity as a Solution to AGI Alignment

Harsha G.26 Feb 2023 23:36 UTC

7 points

7 comments3 min readLW link

Cyborg Periods: There will be multiple AI transitions

Jan_Kulveit and rosehadshar

22 Feb 2023 16:09 UTC

103 points

9 comments6 min readLW link

Cyborgism

NicholasKees and janus

10 Feb 2023 14:47 UTC

333 points

45 comments35 min readLW link

Research Direction: Be the AGI you want to see in the world

scottviteri, sudo and Lauro Langosco

5 Feb 2023 7:15 UTC

43 points

0 comments7 min readLW link

Eli Lifland on Navigating the AI Alignment Landscape

ozziegooen1 Feb 2023 21:17 UTC

9 points

1 comment31 min readLW link

(quri.substack.com)

Model-driven feedback could amplify alignment failures

aogara30 Jan 2023 0:00 UTC

21 points

1 comment2 min readLW link

Reflections on Deception & Generality in Scalable Oversight (Another OpenAI Alignment Review)

Shoshannah Tekofsky28 Jan 2023 5:26 UTC

53 points

7 comments7 min readLW link

[Question] What specific thing would you do with AI Alignment Research Assistant GPT?

quetzal_rainbow8 Jan 2023 19:24 UTC

45 points

9 comments1 min readLW link

[Linkpost] Jan Leike on three kinds of alignment taxes

Akash6 Jan 2023 23:57 UTC

27 points

2 comments3 min readLW link

(aligned.substack.com)

My thoughts on OpenAI’s alignment plan

Akash30 Dec 2022 19:33 UTC

55 points

3 comments20 min readLW link

Results from a survey on tool use and workflows in alignment research

jacquesthibs, Jan, janus and Logan Riggs

19 Dec 2022 15:19 UTC

79 points

2 comments19 min readLW link

Alignment with argument-networks and assessment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC

10 points

5 comments45 min readLW link

[Link] Why I’m optimistic about OpenAI’s alignment approach

janleike5 Dec 2022 22:51 UTC

98 points

15 comments1 min readLW link

(aligned.substack.com)

Research request (alignment strategy): Deep dive on “making AI solve alignment for us”

JanB1 Dec 2022 14:55 UTC

16 points

3 comments1 min readLW link

Provably Honest—A First Step

Srijanak De5 Nov 2022 19:18 UTC

10 points

2 comments8 min readLW link

Infinite Possibility Space and the Shutdown Problem

magfrump18 Oct 2022 5:37 UTC

6 points

0 comments2 min readLW link

(www.magfrump.net)

How should DeepMind’s Chinchilla revise our AI forecasts?

Cleo Nardo15 Sep 2022 17:54 UTC

35 points

12 comments13 min readLW link

AI-assisted list of ten concrete alignment things to do right now

lukehmiles7 Sep 2022 8:38 UTC

8 points

5 comments4 min readLW link

Sufficiently many Godzillas as an alignment strategy

14285728 Aug 2022 0:08 UTC

8 points

3 comments1 min readLW link

Beliefs and Disagreements about Automating Alignment Research

Ian McKenzie24 Aug 2022 18:37 UTC

107 points

4 comments7 min readLW link

[Question] Would you ask a genie to give you the solution to alignment?

sudo24 Aug 2022 1:29 UTC

6 points

1 comment1 min readLW link

Discussion on utilizing AI for alignment

elifland23 Aug 2022 2:36 UTC

16 points

3 comments1 min readLW link

(www.foxy-scout.com)

Human Mimicry Mainly Works When We’re Already Close

johnswentworth17 Aug 2022 18:41 UTC

80 points

16 comments5 min readLW link

Conditioning Generative Models for Alignment

Jozdien18 Jul 2022 7:11 UTC

58 points

8 comments20 min readLW link

Making it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC

15 points

5 comments22 min readLW link

Getting from an unaligned AGI to an aligned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC

13 points

7 comments9 min readLW link

Godzilla Strategies

johnswentworth11 Jun 2022 15:44 UTC

145 points

71 comments3 min readLW link

Prize for Alignment Research Tasks

stuhlmueller and William_S

29 Apr 2022 8:57 UTC

64 points

38 comments10 min readLW link

[Link] A minimal viable product for alignment

janleike6 Apr 2022 15:38 UTC

53 points

38 comments1 min readLW link

A survey of tool use and workflows in alignment research

Logan Riggs, Jan, janus and jacquesthibs

23 Mar 2022 23:44 UTC

45 points

4 comments1 min readLW link

Ngo and Yudkowsky on alignment difficulty

Eliezer Yudkowsky and Richard_Ngo

15 Nov 2021 20:31 UTC

250 points

148 comments99 min readLW link 1 review

No comments.

AI-As­sisted Alignment

AI-Assisted Alignment