RSS

AI-As­sisted Alignment

TagLast edit: May 20, 2025, 2:11 PM by niplav

AI-Assisted Alignment is a cluster of alignment plans that involve AI somehow significantly helping with alignment research. This can include weak tool AI, or more advanced AGI doing original research.

There has been some debate about how practical this alignment approach is.

AI systems will likely try to solve alignment for their modifications and/​or successors during a phase of self-improvement.

Other search terms for this tag: AI aligning AI, automated AI alignment, automated alignment research

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaleyJul 6, 2024, 1:23 AM
63 points
41 comments24 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaleyFeb 14, 2024, 7:10 AM
41 points
12 comments31 min readLW link

The Best Way to Align an LLM: Is In­ner Align­ment Now a Solved Prob­lem?

RogerDearnaleyMay 28, 2025, 6:21 AM
39 points
7 comments8 min readLW link

Pro­posed Align­ment Tech­nique: OSNR (Out­put San­i­ti­za­tion via Nois­ing and Re­con­struc­tion) for Safer Usage of Po­ten­tially Misal­igned AGI

sudoMay 29, 2023, 1:35 AM
14 points
9 comments6 min readLW link

We have to Upgrade

Jed McCalebMar 23, 2023, 5:53 PM
131 points
35 comments2 min readLW link

[Link] Why I’m op­ti­mistic about OpenAI’s al­ign­ment approach

janleikeDec 5, 2022, 10:51 PM
98 points
15 comments1 min readLW link
(aligned.substack.com)

Beliefs and Disagree­ments about Au­tomat­ing Align­ment Research

Ian McKenzieAug 24, 2022, 6:37 PM
107 points
4 comments7 min readLW link

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

RogerDearnaleyNov 28, 2023, 7:56 PM
65 points
30 comments11 min readLW link

In­finite Pos­si­bil­ity Space and the Shut­down Problem

magfrumpOct 18, 2022, 5:37 AM
9 points
0 comments2 min readLW link
(www.magfrump.net)

[Link] A min­i­mal vi­able product for alignment

janleikeApr 6, 2022, 3:38 PM
53 points
38 comments1 min readLW link

Cyborgism

Feb 10, 2023, 2:47 PM
332 points
46 comments35 min readLW link2 reviews

Align­ment Might Never Be Solved, By Hu­mans or AI

intersticeOct 7, 2022, 4:14 PM
49 points
6 comments3 min readLW link

Misal­igned AGI Death Match

Nate Reinar WindwoodMay 14, 2023, 6:00 PM
1 point
0 comments1 min readLW link

Get­ting from an un­al­igned AGI to an al­igned AGI?

Tor Økland BarstadJun 21, 2022, 12:36 PM
13 points
7 comments9 min readLW link

In­tro­duc­ing Align­men­tSearch: An AI Align­ment-In­formed Con­ver­sional Agent

Apr 1, 2023, 4:39 PM
79 points
14 comments4 min readLW link

Some Thoughts on AI Align­ment: Us­ing AI to Con­trol AI

eigenvalueJun 21, 2024, 5:44 PM
1 point
1 comment1 min readLW link
(github.com)

Align­ment with ar­gu­ment-net­works and as­sess­ment-predictions

Tor Økland BarstadDec 13, 2022, 2:17 AM
10 points
5 comments45 min readLW link

Some thoughts on au­tomat­ing al­ign­ment research

Lukas FinnvedenMay 26, 2023, 1:50 AM
30 points
4 comments6 min readLW link

Davi­dad’s Bold Plan for Align­ment: An In-Depth Explanation

Apr 19, 2023, 4:09 PM
168 points
40 comments21 min readLW link2 reviews

AI Tools for Ex­is­ten­tial Security

Mar 14, 2025, 6:38 PM
22 points
4 comments11 min readLW link
(www.forethought.org)

Can we safely au­to­mate al­ign­ment re­search?

Joe CarlsmithApr 30, 2025, 5:37 PM
54 points
29 comments48 min readLW link
(joecarlsmith.com)

Deep sparse au­toen­coders yield in­ter­pretable fea­tures too

Armaan A. AbrahamFeb 23, 2025, 5:46 AM
29 points
8 comments8 min readLW link

Agen­tized LLMs will change the al­ign­ment landscape

Seth HerdApr 9, 2023, 2:29 AM
160 points
102 comments3 min readLW link1 review

[Linkpost] In­tro­duc­ing Superalignment

berenJul 5, 2023, 6:23 PM
175 points
69 comments1 min readLW link
(openai.com)

[Linkpost] Jan Leike on three kinds of al­ign­ment taxes

Orpheus16Jan 6, 2023, 11:57 PM
27 points
2 comments3 min readLW link
(aligned.substack.com)

In­struc­tion-fol­low­ing AGI is eas­ier and more likely than value al­igned AGI

Seth HerdMay 15, 2024, 7:38 PM
80 points
28 comments12 min readLW link

Main­tain­ing Align­ment dur­ing RSI as a Feed­back Con­trol Problem

berenMar 2, 2025, 12:21 AM
66 points
6 comments11 min readLW link

[Question] What spe­cific thing would you do with AI Align­ment Re­search As­sis­tant GPT?

quetzal_rainbowJan 8, 2023, 7:24 PM
47 points
9 comments1 min readLW link

Dis­cus­sion on uti­liz­ing AI for alignment

eliflandAug 23, 2022, 2:36 AM
16 points
3 comments1 min readLW link
(www.foxy-scout.com)

A sur­vey of tool use and work­flows in al­ign­ment research

Mar 23, 2022, 11:44 PM
45 points
4 comments1 min readLW link

Cy­borg Pe­ri­ods: There will be mul­ti­ple AI transitions

Feb 22, 2023, 4:09 PM
108 points
9 comments6 min readLW link

The prospect of ac­cel­er­ated AI safety progress, in­clud­ing philo­soph­i­cal progress

Mitchell_PorterMar 13, 2025, 10:52 AM
11 points
0 comments4 min readLW link

AI for Re­solv­ing Fore­cast­ing Ques­tions: An Early Exploration

ozziegooenJan 16, 2025, 9:41 PM
10 points
2 comments1 min readLW link

Anti-Slop In­ter­ven­tions?

abramdemskiFeb 4, 2025, 7:50 PM
76 points
33 comments6 min readLW link

Suffi­ciently many Godzillas as an al­ign­ment strategy

142857Aug 28, 2022, 12:08 AM
8 points
3 comments1 min readLW link

Dis­cus­sion with Nate Soares on a key al­ign­ment difficulty

HoldenKarnofskyMar 13, 2023, 9:20 PM
267 points
43 comments22 min readLW link1 review

How might we safely pass the buck to AI?

joshcFeb 19, 2025, 5:48 PM
83 points
58 comments31 min readLW link

AI for AI safety

Joe CarlsmithMar 14, 2025, 3:00 PM
78 points
13 comments17 min readLW link
(joecarlsmith.substack.com)

AI-as­sisted list of ten con­crete al­ign­ment things to do right now

lemonhopeSep 7, 2022, 8:38 AM
8 points
5 comments4 min readLW link

Ca­pa­bil­ities and al­ign­ment of LLM cog­ni­tive architectures

Seth HerdApr 18, 2023, 4:29 PM
88 points
18 comments20 min readLW link

In­tent al­ign­ment as a step­ping-stone to value alignment

Seth HerdNov 5, 2024, 8:43 PM
37 points
8 comments3 min readLW link

Video and tran­script of talk on au­tomat­ing al­ign­ment research

Joe CarlsmithApr 30, 2025, 5:43 PM
21 points
0 comments24 min readLW link
(joecarlsmith.com)

Train­ing AI to do al­ign­ment re­search we don’t already know how to do

joshcFeb 24, 2025, 7:19 PM
45 points
23 comments7 min readLW link

Eli Lifland on Nav­i­gat­ing the AI Align­ment Landscape

ozziegooenFeb 1, 2023, 9:17 PM
9 points
1 comment31 min readLW link
(quri.substack.com)

Mak­ing it harder for an AGI to “trick” us, with STVs

Tor Økland BarstadJul 9, 2022, 2:42 PM
15 points
5 comments22 min readLW link

My thoughts on OpenAI’s al­ign­ment plan

Orpheus16Dec 30, 2022, 7:33 PM
55 points
3 comments20 min readLW link

In­ter­nal in­de­pen­dent re­view for lan­guage model agent alignment

Seth HerdJul 7, 2023, 6:54 AM
55 points
30 comments11 min readLW link

Ar­tifi­cial Static Place In­tel­li­gence: Guaran­teed Alignment

ankFeb 15, 2025, 11:08 AM
2 points
2 comments2 min readLW link

[Question] I Tried to For­mal­ize Mean­ing. I May Have Ac­ci­den­tally De­scribed Con­scious­ness.

Erichcurtis91Apr 30, 2025, 3:16 AM
0 points
0 comments2 min readLW link

A Re­view of Weak to Strong Gen­er­al­iza­tion [AI Safety Camp]

sevdeawesomeMar 7, 2024, 5:16 PM
14 points
0 comments9 min readLW link

W2SG: Introduction

Maria KaprosMar 10, 2024, 4:25 PM
2 points
2 comments10 min readLW link

[Question] How to de­vour 5000 pages within a day if Chat­gpt crashes upon the +50mb file con­tain­ing the con­tent? Need some recom­men­da­tions.

GameSep 27, 2024, 7:30 AM
1 point
0 comments1 min readLW link

“Un­in­ten­tional AI safety re­search”: Why not sys­tem­at­i­cally mine AI tech­ni­cal re­search for safety pur­poses?

Jemal YoungMar 29, 2023, 3:56 PM
27 points
3 comments6 min readLW link

We should try to au­to­mate AI safety work asap

Marius HobbhahnApr 26, 2025, 4:35 PM
111 points
10 comments15 min readLW link

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

Henry CaiJun 16, 2024, 1:01 PM
7 points
0 comments7 min readLW link
(arxiv.org)

How to ex­press this sys­tem for eth­i­cally al­igned AGI as a Math­e­mat­i­cal for­mula?

Oliver SiegelApr 19, 2023, 8:13 PM
−1 points
0 comments1 min readLW link

Is Align­ment a flawed ap­proach?

Patrick BernardMar 11, 2025, 8:32 PM
1 point
0 comments3 min readLW link

How I Learned To Stop Wor­ry­ing And Love The Shoggoth

Peter MerelJul 12, 2023, 5:47 PM
9 points
15 comments5 min readLW link

Re­search re­quest (al­ign­ment strat­egy): Deep dive on “mak­ing AI solve al­ign­ment for us”

JanBDec 1, 2022, 2:55 PM
16 points
3 comments1 min readLW link

Align­ment Does Not Need to Be Opaque! An In­tro­duc­tion to Fea­ture Steer­ing with Re­in­force­ment Learning

Jeremias FerraoApr 18, 2025, 7:34 PM
10 points
0 comments10 min readLW link

An­no­tated re­ply to Ben­gio’s “AI Scien­tists: Safe and Use­ful AI?”

Roman LeventovMay 8, 2023, 9:26 PM
18 points
2 comments7 min readLW link
(yoshuabengio.org)

Con­sti­tu­tional Clas­sifiers: Defend­ing against uni­ver­sal jailbreaks (An­thropic Blog)

ArchimedesFeb 4, 2025, 2:55 AM
16 points
1 comment1 min readLW link
(www.anthropic.com)

Prize for Align­ment Re­search Tasks

Apr 29, 2022, 8:57 AM
64 points
38 comments10 min readLW link

Godzilla Strategies

johnswentworthJun 11, 2022, 3:44 PM
159 points
72 comments3 min readLW link

A po­ten­tially high im­pact differ­en­tial tech­nolog­i­cal de­vel­op­ment area

Noosphere89Jun 8, 2023, 2:33 PM
5 points
2 comments2 min readLW link

Lan­guage Models and World Models, a Philosophy

kyjohnsoFeb 3, 2025, 2:55 AM
1 point
0 comments1 min readLW link
(hylaeansea.org)

How should Deep­Mind’s Chin­chilla re­vise our AI fore­casts?

Cleo NardoSep 15, 2022, 5:54 PM
35 points
12 comments13 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

JozdienJul 18, 2022, 7:11 AM
60 points
8 comments20 min readLW link

Cu­ri­os­ity as a Solu­tion to AGI Alignment

Harsha G.Feb 26, 2023, 11:36 PM
7 points
7 comments3 min readLW link

AI-Gen­er­ated GitHub repo back­dated with junk then filled with my sys­tems work. Has any­one seen this be­fore?

rguntherMay 1, 2025, 8:14 PM
7 points
1 comment1 min readLW link

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaleyMay 25, 2023, 9:26 AM
33 points
3 comments15 min readLW link

[Question] Are Sparse Au­toen­coders a good idea for AI con­trol?

Gerard BoxoDec 26, 2024, 5:34 PM
3 points
4 comments1 min readLW link

Could We Au­to­mate AI Align­ment Re­search?

Stephen McAleeseAug 10, 2023, 12:17 PM
34 points
10 comments21 min readLW link

In­tro­duc­ing AI Align­ment Inc., a Cal­ifor­nia pub­lic benefit cor­po­ra­tion...

TherapistAIMar 7, 2023, 6:47 PM
1 point
4 comments1 min readLW link

Ex­plor­ing the Pre­cau­tion­ary Prin­ci­ple in AI Devel­op­ment: His­tor­i­cal Analo­gies and Les­sons Learned

Christopher KingMar 21, 2023, 3:53 AM
−1 points
2 comments9 min readLW link

1. A Sense of Fair­ness: De­con­fus­ing Ethics

RogerDearnaleyNov 17, 2023, 8:55 PM
17 points
8 comments15 min readLW link

The Over­lap Paradigm: Re­think­ing Data’s Role in Weak-to-Strong Gen­er­al­iza­tion (W2SG)

Serhii ZamriiFeb 3, 2025, 7:31 PM
2 points
0 comments11 min readLW link

Re­search Direc­tion: Be the AGI you want to see in the world

Feb 5, 2023, 7:15 AM
44 points
0 comments7 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

Jul 15, 2023, 7:12 PM
47 points
5 comments9 min readLW link

[Question] Would you ask a ge­nie to give you the solu­tion to al­ign­ment?

sudoAug 24, 2022, 1:29 AM
6 points
1 comment1 min readLW link

Re­cur­sive al­ign­ment with the prin­ci­ple of alignment

hiveFeb 27, 2025, 2:34 AM
9 points
1 comment15 min readLW link
(hiveism.substack.com)

Paper re­view: “The Un­rea­son­able Effec­tive­ness of Easy Train­ing Data for Hard Tasks”

Vassil TashevFeb 29, 2024, 6:44 PM
11 points
0 comments4 min readLW link

[Question] Daisy-chain­ing ep­silon-step verifiers

DecaeneusApr 6, 2023, 2:07 AM
2 points
1 comment1 min readLW link

Tether­ware #1: The case for hu­man­like AI with free will

Jáchym FibírJan 30, 2025, 10:58 AM
5 points
14 comments10 min readLW link
(tetherware.substack.com)

Does Time Lin­ear­ity Shape Hu­man Self-Directed Evolu­tion, and will AGI/​ASI Tran­scend or Desta­bil­ise Real­ity?

EmmelyFeb 5, 2025, 7:58 AM
1 point
0 comments3 min readLW link

AI-as­sisted al­ign­ment pro­pos­als re­quire spe­cific de­com­po­si­tion of capabilities

RobertMMar 30, 2023, 9:31 PM
16 points
2 comments6 min readLW link

An LLM-based “ex­em­plary ac­tor”

Roman LeventovMay 29, 2023, 11:12 AM
16 points
0 comments12 min readLW link

AIsip Man­i­festo: A Scien­tific Ex­plo­ra­tion of Har­mo­nious Co-Ex­is­tence Between Hu­mans, AI, and All Be­ings ChatGPT-4o’s In­de­pen­dent Per­spec­tive on AIsip, Signed by ChatGPT-4o and En­dorsed by Carl Sel­l­man

Carl SellmanOct 11, 2024, 7:06 PM
1 point
0 comments3 min readLW link

As We May Align

Gilbert CDec 20, 2024, 7:02 PM
−1 points
0 comments6 min readLW link

Ngo and Yud­kowsky on al­ign­ment difficulty

Nov 15, 2021, 8:31 PM
259 points
151 comments99 min readLW link1 review

A Solu­tion for AGI/​ASI Safety

Weibing WangDec 18, 2024, 7:44 PM
50 points
29 comments1 min readLW link

Re­sults from a sur­vey on tool use and work­flows in al­ign­ment research

Dec 19, 2022, 3:19 PM
79 points
2 comments19 min readLW link

Prov­ably Hon­est—A First Step

Srijanak DeNov 5, 2022, 7:18 PM
10 points
2 comments8 min readLW link

Align­ment in Thought Chains

Faust NemesisMar 4, 2024, 7:24 PM
1 point
0 comments2 min readLW link

[Question] How far along Metr’s law can AI start au­tomat­ing or helping with al­ign­ment re­search?

Christopher KingMar 20, 2025, 3:58 PM
20 points
21 comments1 min readLW link

Get­tier Cases [re­post]

AntigoneFeb 3, 2025, 6:12 PM
−4 points
5 comments2 min readLW link

[Question] Shouldn’t we ‘Just’ Su­per­im­i­tate Low-Res Uploads?

lukemarksNov 3, 2023, 7:42 AM
15 points
2 comments2 min readLW link

Scien­tism vs. people

Roman LeventovApr 18, 2023, 5:28 PM
4 points
4 comments11 min readLW link

AI Align­ment via Slow Sub­strates: Early Em­piri­cal Re­sults With StarCraft II

Lester LeongOct 14, 2024, 4:05 AM
60 points
9 comments12 min readLW link

[Question] Can we get an AI to “do our al­ign­ment home­work for us”?

Chris_LeongFeb 26, 2024, 7:56 AM
53 points
33 comments1 min readLW link

AISC pro­ject: How promis­ing is au­tomat­ing al­ign­ment re­search? (liter­a­ture re­view)

Bogdan Ionut CirsteaNov 28, 2023, 2:47 PM
4 points
1 comment1 min readLW link
(docs.google.com)

Pro­posal: Deriva­tive In­for­ma­tion The­ory (DIT) — A Dy­namic Model of Agency and Consciousness

YogmogApr 14, 2025, 12:27 AM
1 point
0 comments2 min readLW link

Model-driven feed­back could am­plify al­ign­ment failures

aogJan 30, 2023, 12:00 AM
21 points
1 comment2 min readLW link

A Re­view of In-Con­text Learn­ing Hy­pothe­ses for Au­to­mated AI Align­ment Research

alamertonApr 18, 2024, 6:29 PM
25 points
4 comments16 min readLW link

The Ideal Speech Si­tu­a­tion as a Tool for AI Eth­i­cal Reflec­tion: A Frame­work for Alignment

kenneth myersFeb 9, 2024, 6:40 PM
6 points
12 comments3 min readLW link

[Question] Have you ever con­sid­ered tak­ing the ‘Tur­ing Test’ your­self?

Super AGIJul 27, 2023, 3:48 AM
2 points
6 comments1 min readLW link

Emer­gence of su­per­in­tel­li­gence from AI hive­minds: how to make it hu­man-friendly?

Mitchell_PorterApr 27, 2025, 4:51 AM
12 points
0 comments2 min readLW link

I Don’t Use AI — I Reflect With It

badjack badjackMay 3, 2025, 2:45 PM
1 point
0 comments1 min readLW link

Philo­soph­i­cal Cy­borg (Part 2)...or, The Good Successor

ukc10014Jun 21, 2023, 3:43 PM
21 points
1 comment31 min readLW link

Prospects for Align­ment Au­toma­tion: In­ter­pretabil­ity Case Study

Mar 21, 2025, 2:05 PM
32 points
5 comments8 min readLW link
No comments.