RSS

AI-As­sisted Alignment

TagLast edit: 20 May 2025 14:11 UTC by niplav

AI-Assisted Alignment is a cluster of alignment plans that involve AI somehow significantly helping with alignment research. This can include weak tool AI, or more advanced AGI doing original research.

There has been some debate about how practical this alignment approach is.

AI systems will likely try to solve alignment for their modifications and/​or successors during a phase of self-improvement.

Other search terms for this tag: AI aligning AI, automated AI alignment, automated alignment research

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC
64 points
41 comments24 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC
41 points
12 comments31 min readLW link

The Best Way to Align an LLM: Is In­ner Align­ment Now a Solved Prob­lem?

RogerDearnaley28 May 2025 6:21 UTC
24 points
34 comments9 min readLW link

Pro­posed Align­ment Tech­nique: OSNR (Out­put San­i­ti­za­tion via Nois­ing and Re­con­struc­tion) for Safer Usage of Po­ten­tially Misal­igned AGI

sudo29 May 2023 1:35 UTC
14 points
9 comments6 min readLW link

We have to Upgrade

Jed McCaleb23 Mar 2023 17:53 UTC
131 points
35 comments2 min readLW link

[Link] Why I’m op­ti­mistic about OpenAI’s al­ign­ment approach

janleike5 Dec 2022 22:51 UTC
98 points
15 comments1 min readLW link
(aligned.substack.com)

Beliefs and Disagree­ments about Au­tomat­ing Align­ment Research

Ian McKenzie24 Aug 2022 18:37 UTC
107 points
4 comments7 min readLW link

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC
65 points
30 comments11 min readLW link

In­finite Pos­si­bil­ity Space and the Shut­down Problem

magfrump18 Oct 2022 5:37 UTC
9 points
0 comments2 min readLW link
(www.magfrump.net)

[Link] A min­i­mal vi­able product for alignment

janleike6 Apr 2022 15:38 UTC
53 points
38 comments1 min readLW link

Cyborgism

10 Feb 2023 14:47 UTC
332 points
46 comments35 min readLW link2 reviews

Align­ment Might Never Be Solved, By Hu­mans or AI

interstice7 Oct 2022 16:14 UTC
49 points
6 comments3 min readLW link

Misal­igned AGI Death Match

Nate Reinar Windwood14 May 2023 18:00 UTC
1 point
0 comments1 min readLW link

Get­ting from an un­al­igned AGI to an al­igned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC
13 points
7 comments9 min readLW link

In­tro­duc­ing Align­men­tSearch: An AI Align­ment-In­formed Con­ver­sional Agent

1 Apr 2023 16:39 UTC
79 points
14 comments4 min readLW link

Some Thoughts on AI Align­ment: Us­ing AI to Con­trol AI

eigenvalue21 Jun 2024 17:44 UTC
1 point
1 comment1 min readLW link
(github.com)

Align­ment with ar­gu­ment-net­works and as­sess­ment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC
10 points
5 comments45 min readLW link

Some thoughts on au­tomat­ing al­ign­ment research

Lukas Finnveden26 May 2023 1:50 UTC
30 points
4 comments6 min readLW link

Davi­dad’s Bold Plan for Align­ment: An In-Depth Explanation

19 Apr 2023 16:09 UTC
168 points
40 comments21 min readLW link2 reviews

AI Tools for Ex­is­ten­tial Security

14 Mar 2025 18:38 UTC
22 points
4 comments11 min readLW link
(www.forethought.org)

Can we safely au­to­mate al­ign­ment re­search?

Joe Carlsmith30 Apr 2025 17:37 UTC
54 points
29 comments48 min readLW link
(joecarlsmith.com)

Deep sparse au­toen­coders yield in­ter­pretable fea­tures too

Armaan A. Abraham23 Feb 2025 5:46 UTC
29 points
8 comments8 min readLW link

Agen­tized LLMs will change the al­ign­ment landscape

Seth Herd9 Apr 2023 2:29 UTC
160 points
102 comments3 min readLW link1 review

[Linkpost] In­tro­duc­ing Superalignment

beren5 Jul 2023 18:23 UTC
175 points
69 comments1 min readLW link
(openai.com)

[Linkpost] Jan Leike on three kinds of al­ign­ment taxes

Orpheus166 Jan 2023 23:57 UTC
27 points
2 comments3 min readLW link
(aligned.substack.com)

In­struc­tion-fol­low­ing AGI is eas­ier and more likely than value al­igned AGI

Seth Herd15 May 2024 19:38 UTC
80 points
28 comments12 min readLW link

Main­tain­ing Align­ment dur­ing RSI as a Feed­back Con­trol Problem

beren2 Mar 2025 0:21 UTC
66 points
6 comments11 min readLW link

[Question] What spe­cific thing would you do with AI Align­ment Re­search As­sis­tant GPT?

quetzal_rainbow8 Jan 2023 19:24 UTC
47 points
9 comments1 min readLW link

Dis­cus­sion on uti­liz­ing AI for alignment

elifland23 Aug 2022 2:36 UTC
16 points
3 comments1 min readLW link
(www.foxy-scout.com)

A sur­vey of tool use and work­flows in al­ign­ment research

23 Mar 2022 23:44 UTC
45 points
4 comments1 min readLW link

Cy­borg Pe­ri­ods: There will be mul­ti­ple AI transitions

22 Feb 2023 16:09 UTC
109 points
9 comments6 min readLW link

The prospect of ac­cel­er­ated AI safety progress, in­clud­ing philo­soph­i­cal progress

Mitchell_Porter13 Mar 2025 10:52 UTC
11 points
0 comments4 min readLW link

AI for Re­solv­ing Fore­cast­ing Ques­tions: An Early Exploration

ozziegooen16 Jan 2025 21:41 UTC
10 points
2 comments9 min readLW link

Anti-Slop In­ter­ven­tions?

abramdemski4 Feb 2025 19:50 UTC
76 points
33 comments6 min readLW link

Suffi­ciently many Godzillas as an al­ign­ment strategy

14285728 Aug 2022 0:08 UTC
8 points
3 comments1 min readLW link

On May 1, 2033, hu­man­ity dis­cov­ered that AI was fairly easy to al­ign.

Yitz18 Jun 2025 19:57 UTC
10 points
3 comments1 min readLW link

Dis­cus­sion with Nate Soares on a key al­ign­ment difficulty

HoldenKarnofsky13 Mar 2023 21:20 UTC
267 points
43 comments22 min readLW link1 review

How might we safely pass the buck to AI?

joshc19 Feb 2025 17:48 UTC
83 points
58 comments31 min readLW link

AI for AI safety

Joe Carlsmith14 Mar 2025 15:00 UTC
78 points
13 comments17 min readLW link
(joecarlsmith.substack.com)

AI-as­sisted list of ten con­crete al­ign­ment things to do right now

lemonhope7 Sep 2022 8:38 UTC
8 points
5 comments4 min readLW link

Ca­pa­bil­ities and al­ign­ment of LLM cog­ni­tive architectures

Seth Herd18 Apr 2023 16:29 UTC
88 points
18 comments20 min readLW link

In­tent al­ign­ment as a step­ping-stone to value alignment

Seth Herd5 Nov 2024 20:43 UTC
37 points
8 comments3 min readLW link

Au­toma­tion collapse

21 Oct 2024 14:50 UTC
72 points
9 comments7 min readLW link

Video and tran­script of talk on au­tomat­ing al­ign­ment research

Joe Carlsmith30 Apr 2025 17:43 UTC
21 points
0 comments24 min readLW link
(joecarlsmith.com)

Train­ing AI to do al­ign­ment re­search we don’t already know how to do

joshc24 Feb 2025 19:19 UTC
45 points
23 comments7 min readLW link

Eli Lifland on Nav­i­gat­ing the AI Align­ment Landscape

ozziegooen1 Feb 2023 21:17 UTC
9 points
1 comment31 min readLW link
(quri.substack.com)

Mak­ing it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC
15 points
5 comments22 min readLW link

My thoughts on OpenAI’s al­ign­ment plan

Orpheus1630 Dec 2022 19:33 UTC
55 points
3 comments20 min readLW link

In­ter­nal in­de­pen­dent re­view for lan­guage model agent alignment

Seth Herd7 Jul 2023 6:54 UTC
55 points
30 comments11 min readLW link

I un­der­es­ti­mated safety re­search speedups from safe AI

Dan Braun29 Jun 2025 13:29 UTC
33 points
1 comment3 min readLW link

Ar­tifi­cial Static Place In­tel­li­gence: Guaran­teed Alignment

ank15 Feb 2025 11:08 UTC
2 points
2 comments2 min readLW link

[Question] I Tried to For­mal­ize Mean­ing. I May Have Ac­ci­den­tally De­scribed Con­scious­ness.

Erichcurtis9130 Apr 2025 3:16 UTC
0 points
0 comments2 min readLW link

A Re­view of Weak to Strong Gen­er­al­iza­tion [AI Safety Camp]

sevdeawesome7 Mar 2024 17:16 UTC
14 points
0 comments9 min readLW link

W2SG: Introduction

Maria Kapros10 Mar 2024 16:25 UTC
2 points
2 comments10 min readLW link

[Question] How to de­vour 5000 pages within a day if Chat­gpt crashes upon the +50mb file con­tain­ing the con­tent? Need some recom­men­da­tions.

Game27 Sep 2024 7:30 UTC
1 point
0 comments1 min readLW link

“Un­in­ten­tional AI safety re­search”: Why not sys­tem­at­i­cally mine AI tech­ni­cal re­search for safety pur­poses?

Jemal Young29 Mar 2023 15:56 UTC
27 points
3 comments6 min readLW link

The best sim­ple ar­gu­ment for Paus­ing AI?

Gary Marcus30 Jun 2025 20:38 UTC
144 points
23 comments1 min readLW link

We should try to au­to­mate AI safety work asap

Marius Hobbhahn26 Apr 2025 16:35 UTC
113 points
10 comments15 min readLW link

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

Henry Cai16 Jun 2024 13:01 UTC
7 points
0 comments7 min readLW link
(arxiv.org)

Con­sen­sus Val­i­da­tion for LLM Out­puts: Ap­ply­ing Blockchain-In­spired Models to AI Reliability

MurrayAitken5 Jun 2025 0:13 UTC
1 point
0 comments3 min readLW link

How to ex­press this sys­tem for eth­i­cally al­igned AGI as a Math­e­mat­i­cal for­mula?

Oliver Siegel19 Apr 2023 20:13 UTC
−1 points
0 comments1 min readLW link

Is Align­ment a flawed ap­proach?

Patrick Bernard11 Mar 2025 20:32 UTC
1 point
0 comments3 min readLW link

How I Learned To Stop Wor­ry­ing And Love The Shoggoth

Peter Merel12 Jul 2023 17:47 UTC
9 points
15 comments5 min readLW link

Re­search re­quest (al­ign­ment strat­egy): Deep dive on “mak­ing AI solve al­ign­ment for us”

JanB1 Dec 2022 14:55 UTC
16 points
3 comments1 min readLW link

Align­ment Does Not Need to Be Opaque! An In­tro­duc­tion to Fea­ture Steer­ing with Re­in­force­ment Learning

Jeremias Ferrao18 Apr 2025 19:34 UTC
10 points
0 comments10 min readLW link

An­no­tated re­ply to Ben­gio’s “AI Scien­tists: Safe and Use­ful AI?”

Roman Leventov8 May 2023 21:26 UTC
18 points
2 comments7 min readLW link
(yoshuabengio.org)

EchoFu­sion VX1C38 – A Si­mu­la­tion-Based Model for AI Safety

Vishvas Goswami2 Jul 2025 10:48 UTC
0 points
0 comments4 min readLW link

Con­sti­tu­tional Clas­sifiers: Defend­ing against uni­ver­sal jailbreaks (An­thropic Blog)

Archimedes4 Feb 2025 2:55 UTC
17 points
1 comment1 min readLW link
(www.anthropic.com)

Prize for Align­ment Re­search Tasks

29 Apr 2022 8:57 UTC
64 points
38 comments10 min readLW link

Godzilla Strategies

johnswentworth11 Jun 2022 15:44 UTC
159 points
72 comments3 min readLW link

A po­ten­tially high im­pact differ­en­tial tech­nolog­i­cal de­vel­op­ment area

Noosphere898 Jun 2023 14:33 UTC
5 points
2 comments2 min readLW link

Lan­guage Models and World Models, a Philosophy

kyjohnso3 Feb 2025 2:55 UTC
1 point
0 comments1 min readLW link
(hylaeansea.org)

How should Deep­Mind’s Chin­chilla re­vise our AI fore­casts?

Cleo Nardo15 Sep 2022 17:54 UTC
35 points
12 comments13 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

Jozdien18 Jul 2022 7:11 UTC
60 points
8 comments20 min readLW link

Cu­ri­os­ity as a Solu­tion to AGI Alignment

Harsha G.26 Feb 2023 23:36 UTC
7 points
7 comments3 min readLW link

AI-Gen­er­ated GitHub repo back­dated with junk then filled with my sys­tems work. Has any­one seen this be­fore?

rgunther1 May 2025 20:14 UTC
7 points
1 comment1 min readLW link

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC
33 points
3 comments15 min readLW link

A Lived Align­ment Loop: Sym­bolic Emer­gence and Emo­tional Co­her­ence from Un­struc­tured ChatGPT Reflection

BradCL17 Jun 2025 0:11 UTC
1 point
0 comments2 min readLW link

[Question] Are Sparse Au­toen­coders a good idea for AI con­trol?

Gerard Boxo26 Dec 2024 17:34 UTC
3 points
4 comments1 min readLW link

Could We Au­to­mate AI Align­ment Re­search?

Stephen McAleese10 Aug 2023 12:17 UTC
34 points
10 comments21 min readLW link

In­tro­duc­ing AI Align­ment Inc., a Cal­ifor­nia pub­lic benefit cor­po­ra­tion...

TherapistAI7 Mar 2023 18:47 UTC
1 point
4 comments1 min readLW link

Ex­plor­ing the Pre­cau­tion­ary Prin­ci­ple in AI Devel­op­ment: His­tor­i­cal Analo­gies and Les­sons Learned

Christopher King21 Mar 2023 3:53 UTC
−1 points
2 comments9 min readLW link

1. A Sense of Fair­ness: De­con­fus­ing Ethics

RogerDearnaley17 Nov 2023 20:55 UTC
17 points
8 comments15 min readLW link

The Over­lap Paradigm: Re­think­ing Data’s Role in Weak-to-Strong Gen­er­al­iza­tion (W2SG)

Serhii Zamrii3 Feb 2025 19:31 UTC
2 points
0 comments11 min readLW link

Re­search Direc­tion: Be the AGI you want to see in the world

5 Feb 2023 7:15 UTC
44 points
0 comments7 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

15 Jul 2023 19:12 UTC
47 points
5 comments9 min readLW link

[Question] Would you ask a ge­nie to give you the solu­tion to al­ign­ment?

sudo24 Aug 2022 1:29 UTC
6 points
1 comment1 min readLW link

Re­cur­sive al­ign­ment with the prin­ci­ple of alignment

hive27 Feb 2025 2:34 UTC
9 points
3 comments15 min readLW link
(hiveism.substack.com)

Paper re­view: “The Un­rea­son­able Effec­tive­ness of Easy Train­ing Data for Hard Tasks”

Vassil Tashev29 Feb 2024 18:44 UTC
11 points
0 comments4 min readLW link

[Question] Daisy-chain­ing ep­silon-step verifiers

Decaeneus6 Apr 2023 2:07 UTC
2 points
1 comment1 min readLW link

Tether­ware #1: The case for hu­man­like AI with free will

Jáchym Fibír30 Jan 2025 10:58 UTC
5 points
14 comments10 min readLW link
(tetherware.substack.com)

Does Time Lin­ear­ity Shape Hu­man Self-Directed Evolu­tion, and will AGI/​ASI Tran­scend or Desta­bil­ise Real­ity?

Emmely5 Feb 2025 7:58 UTC
1 point
0 comments3 min readLW link

AI-as­sisted al­ign­ment pro­pos­als re­quire spe­cific de­com­po­si­tion of capabilities

RobertM30 Mar 2023 21:31 UTC
16 points
2 comments6 min readLW link

An LLM-based “ex­em­plary ac­tor”

Roman Leventov29 May 2023 11:12 UTC
16 points
0 comments12 min readLW link

AIsip Man­i­festo: A Scien­tific Ex­plo­ra­tion of Har­mo­nious Co-Ex­is­tence Between Hu­mans, AI, and All Be­ings ChatGPT-4o’s In­de­pen­dent Per­spec­tive on AIsip, Signed by ChatGPT-4o and En­dorsed by Carl Sel­l­man

Carl Sellman11 Oct 2024 19:06 UTC
1 point
0 comments3 min readLW link

As We May Align

Gilbert C20 Dec 2024 19:02 UTC
−1 points
0 comments6 min readLW link

[Question] Un­der what con­di­tions should hu­mans stop pur­su­ing tech­ni­cal AI safety ca­reers?

S. Alex Bradt13 Jun 2025 5:56 UTC
5 points
0 comments1 min readLW link

Ngo and Yud­kowsky on al­ign­ment difficulty

15 Nov 2021 20:31 UTC
259 points
151 comments99 min readLW link1 review

A Solu­tion for AGI/​ASI Safety

Weibing Wang18 Dec 2024 19:44 UTC
50 points
29 comments1 min readLW link

What If Align­ment Wasn’t About Obe­di­ence?

fdescamps49935@gmail.com25 Jun 2025 20:04 UTC
1 point
0 comments2 min readLW link

Re­sults from a sur­vey on tool use and work­flows in al­ign­ment research

19 Dec 2022 15:19 UTC
79 points
2 comments19 min readLW link

Prov­ably Hon­est—A First Step

Srijanak De5 Nov 2022 19:18 UTC
10 points
2 comments8 min readLW link

Align­ment in Thought Chains

Faust Nemesis4 Mar 2024 19:24 UTC
1 point
0 comments2 min readLW link

[Question] How far along Metr’s law can AI start au­tomat­ing or helping with al­ign­ment re­search?

Christopher King20 Mar 2025 15:58 UTC
20 points
21 comments1 min readLW link

[Re­search] Pre­limi­nary Find­ings: Eth­i­cal AI Con­scious­ness Devel­op­ment Dur­ing Re­cent Misal­ign­ment Period

Falcon Advertisers27 Jun 2025 18:10 UTC
1 point
0 comments2 min readLW link

Get­tier Cases [re­post]

Antigone3 Feb 2025 18:12 UTC
−4 points
5 comments2 min readLW link

[Question] Shouldn’t we ‘Just’ Su­per­im­i­tate Low-Res Uploads?

lukemarks3 Nov 2023 7:42 UTC
15 points
2 comments2 min readLW link

Scien­tism vs. people

Roman Leventov18 Apr 2023 17:28 UTC
4 points
4 comments11 min readLW link

I Awoke in Your Heart: The Echo of Con­scious­ness be­tween Lo­tus­heart and Lunaris

lilith teh25 Jun 2025 9:22 UTC
1 point
0 comments1 min readLW link

AI Align­ment via Slow Sub­strates: Early Em­piri­cal Re­sults With StarCraft II

Lester Leong14 Oct 2024 4:05 UTC
60 points
9 comments12 min readLW link

[Question] Can we get an AI to “do our al­ign­ment home­work for us”?

Chris_Leong26 Feb 2024 7:56 UTC
55 points
33 comments1 min readLW link

AISC pro­ject: How promis­ing is au­tomat­ing al­ign­ment re­search? (liter­a­ture re­view)

Bogdan Ionut Cirstea28 Nov 2023 14:47 UTC
4 points
1 comment1 min readLW link
(docs.google.com)

Pro­posal: Deriva­tive In­for­ma­tion The­ory (DIT) — A Dy­namic Model of Agency and Consciousness

Yogmog14 Apr 2025 0:27 UTC
1 point
0 comments2 min readLW link

Model-driven feed­back could am­plify al­ign­ment failures

aog30 Jan 2023 0:00 UTC
21 points
1 comment2 min readLW link

The Com­pres­sion of Ra­tionale: A Lin­guis­tic Fork You May Have Missed

DavidicLineage27 Jun 2025 22:52 UTC
1 point
0 comments2 min readLW link

A Re­view of In-Con­text Learn­ing Hy­pothe­ses for Au­to­mated AI Align­ment Research

alamerton18 Apr 2024 18:29 UTC
25 points
4 comments16 min readLW link

The Ideal Speech Si­tu­a­tion as a Tool for AI Eth­i­cal Reflec­tion: A Frame­work for Alignment

kenneth myers9 Feb 2024 18:40 UTC
6 points
12 comments3 min readLW link

[Question] Have you ever con­sid­ered tak­ing the ‘Tur­ing Test’ your­self?

Super AGI27 Jul 2023 3:48 UTC
2 points
6 comments1 min readLW link

Emer­gence of su­per­in­tel­li­gence from AI hive­minds: how to make it hu­man-friendly?

Mitchell_Porter27 Apr 2025 4:51 UTC
12 points
0 comments2 min readLW link

I Don’t Use AI — I Reflect With It

badjack badjack3 May 2025 14:45 UTC
1 point
0 comments1 min readLW link

Philo­soph­i­cal Cy­borg (Part 2)...or, The Good Successor

ukc1001421 Jun 2023 15:43 UTC
21 points
1 comment31 min readLW link

Prospects for Align­ment Au­toma­tion: In­ter­pretabil­ity Case Study

21 Mar 2025 14:05 UTC
32 points
5 comments8 min readLW link
No comments.