RSS

AI-As­sisted Alignment

TagLast edit: May 20, 2025, 2:11 PM by niplav

AI-Assisted Alignment is a cluster of alignment plans that involve AI somehow significantly helping with alignment research. This can include weak tool AI, or more advanced AGI doing original research.

There has been some debate about how practical this alignment approach is.

AI systems will likely try to solve alignment for their modifications and/​or successors during a phase of self-improvement.

Other search terms for this tag: AI aligning AI, automated AI alignment, automated alignment research

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaleyJul 6, 2024, 1:23 AM
64 points

30 votes

Overall karma indicates overall quality.

41 comments24 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaleyFeb 14, 2024, 7:10 AM
41 points

12 votes

Overall karma indicates overall quality.

12 comments31 min readLW link

The Best Way to Align an LLM: Is In­ner Align­ment Now a Solved Prob­lem?

RogerDearnaleyMay 28, 2025, 6:21 AM
31 points

36 votes

Overall karma indicates overall quality.

34 comments9 min readLW link

Pro­posed Align­ment Tech­nique: OSNR (Out­put San­i­ti­za­tion via Nois­ing and Re­con­struc­tion) for Safer Usage of Po­ten­tially Misal­igned AGI

sudoMay 29, 2023, 1:35 AM
14 points

4 votes

Overall karma indicates overall quality.

9 comments6 min readLW link

We have to Upgrade

Jed McCalebMar 23, 2023, 5:53 PM
131 points

73 votes

Overall karma indicates overall quality.

35 comments2 min readLW link

[Link] Why I’m op­ti­mistic about OpenAI’s al­ign­ment approach

janleikeDec 5, 2022, 10:51 PM
98 points

48 votes

Overall karma indicates overall quality.

15 comments1 min readLW link
(aligned.substack.com)

Beliefs and Disagree­ments about Au­tomat­ing Align­ment Research

Ian McKenzieAug 24, 2022, 6:37 PM
107 points

44 votes

Overall karma indicates overall quality.

4 comments7 min readLW link

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

RogerDearnaleyNov 28, 2023, 7:56 PM
65 points

37 votes

Overall karma indicates overall quality.

30 comments11 min readLW link

In­finite Pos­si­bil­ity Space and the Shut­down Problem

magfrumpOct 18, 2022, 5:37 AM
9 points

4 votes

Overall karma indicates overall quality.

0 comments2 min readLW link
(www.magfrump.net)

[Link] A min­i­mal vi­able product for alignment

janleikeApr 6, 2022, 3:38 PM
53 points

20 votes

Overall karma indicates overall quality.

38 comments1 min readLW link

Cyborgism

Feb 10, 2023, 2:47 PM
334 points

192 votes

Overall karma indicates overall quality.

47 comments35 min readLW link2 reviews

Align­ment Might Never Be Solved, By Hu­mans or AI

intersticeOct 7, 2022, 4:14 PM
49 points

25 votes

Overall karma indicates overall quality.

6 comments3 min readLW link

Misal­igned AGI Death Match

Nate Reinar WindwoodMay 14, 2023, 6:00 PM
1 point

3 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

Get­ting from an un­al­igned AGI to an al­igned AGI?

Tor Økland BarstadJun 21, 2022, 12:36 PM
13 points

9 votes

Overall karma indicates overall quality.

7 comments9 min readLW link

In­tro­duc­ing Align­men­tSearch: An AI Align­ment-In­formed Con­ver­sional Agent

Apr 1, 2023, 4:39 PM
79 points

39 votes

Overall karma indicates overall quality.

14 comments4 min readLW link

Some Thoughts on AI Align­ment: Us­ing AI to Con­trol AI

eigenvalueJun 21, 2024, 5:44 PM
1 point

1 vote

Overall karma indicates overall quality.

1 comment1 min readLW link
(github.com)

Align­ment with ar­gu­ment-net­works and as­sess­ment-predictions

Tor Økland BarstadDec 13, 2022, 2:17 AM
10 points

3 votes

Overall karma indicates overall quality.

5 comments45 min readLW link

Some thoughts on au­tomat­ing al­ign­ment research

Lukas FinnvedenMay 26, 2023, 1:50 AM
30 points

10 votes

Overall karma indicates overall quality.

4 comments6 min readLW link

Davi­dad’s Bold Plan for Align­ment: An In-Depth Explanation

Apr 19, 2023, 4:09 PM
169 points

86 votes

Overall karma indicates overall quality.

40 comments21 min readLW link2 reviews

AI Tools for Ex­is­ten­tial Security

Mar 14, 2025, 6:38 PM
22 points

6 votes

Overall karma indicates overall quality.

4 comments11 min readLW link
(www.forethought.org)

Can we safely au­to­mate al­ign­ment re­search?

Joe CarlsmithApr 30, 2025, 5:37 PM
47 points

16 votes

Overall karma indicates overall quality.

29 comments48 min readLW link
(joecarlsmith.com)

Deep sparse au­toen­coders yield in­ter­pretable fea­tures too

Armaan A. AbrahamFeb 23, 2025, 5:46 AM
31 points

10 votes

Overall karma indicates overall quality.

8 comments8 min readLW link

Agen­tized LLMs will change the al­ign­ment landscape

Seth HerdApr 9, 2023, 2:29 AM
160 points

115 votes

Overall karma indicates overall quality.

102 comments3 min readLW link1 review

[Linkpost] In­tro­duc­ing Superalignment

berenJul 5, 2023, 6:23 PM
175 points

80 votes

Overall karma indicates overall quality.

69 comments1 min readLW link
(openai.com)

[Linkpost] Jan Leike on three kinds of al­ign­ment taxes

Orpheus16Jan 6, 2023, 11:57 PM
27 points

5 votes

Overall karma indicates overall quality.

2 comments3 min readLW link
(aligned.substack.com)

In­struc­tion-fol­low­ing AGI is eas­ier and more likely than value al­igned AGI

Seth HerdMay 15, 2024, 7:38 PM
80 points

34 votes

Overall karma indicates overall quality.

28 comments12 min readLW link

Main­tain­ing Align­ment dur­ing RSI as a Feed­back Con­trol Problem

berenMar 2, 2025, 12:21 AM
67 points

22 votes

Overall karma indicates overall quality.

6 comments11 min readLW link

[Question] What spe­cific thing would you do with AI Align­ment Re­search As­sis­tant GPT?

quetzal_rainbowJan 8, 2023, 7:24 PM
47 points

15 votes

Overall karma indicates overall quality.

9 comments1 min readLW link

Dis­cus­sion on uti­liz­ing AI for alignment

eliflandAug 23, 2022, 2:36 AM
16 points

7 votes

Overall karma indicates overall quality.

3 comments1 min readLW link
(www.foxy-scout.com)

A sur­vey of tool use and work­flows in al­ign­ment research

Mar 23, 2022, 11:44 PM
45 points

23 votes

Overall karma indicates overall quality.

4 comments1 min readLW link

Cy­borg Pe­ri­ods: There will be mul­ti­ple AI transitions

Feb 22, 2023, 4:09 PM
109 points

62 votes

Overall karma indicates overall quality.

9 comments6 min readLW link

The prospect of ac­cel­er­ated AI safety progress, in­clud­ing philo­soph­i­cal progress

Mitchell_PorterMar 13, 2025, 10:52 AM
11 points

3 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

AI for Re­solv­ing Fore­cast­ing Ques­tions: An Early Exploration

ozziegooenJan 16, 2025, 9:41 PM
10 points

3 votes

Overall karma indicates overall quality.

2 comments9 min readLW link

Anti-Slop In­ter­ven­tions?

abramdemskiFeb 4, 2025, 7:50 PM
76 points

28 votes

Overall karma indicates overall quality.

33 comments6 min readLW link

Suffi­ciently many Godzillas as an al­ign­ment strategy

142857Aug 28, 2022, 12:08 AM
8 points

4 votes

Overall karma indicates overall quality.

3 comments1 min readLW link

On May 1, 2033, hu­man­ity dis­cov­ered that AI was fairly easy to al­ign.

YitzJun 18, 2025, 7:57 PM
10 points

7 votes

Overall karma indicates overall quality.

3 comments1 min readLW link

Dis­cus­sion with Nate Soares on a key al­ign­ment difficulty

HoldenKarnofskyMar 13, 2023, 9:20 PM
267 points

94 votes

Overall karma indicates overall quality.

43 comments22 min readLW link1 review

How might we safely pass the buck to AI?

joshcFeb 19, 2025, 5:48 PM
83 points

53 votes

Overall karma indicates overall quality.

58 comments31 min readLW link

AI for AI safety

Joe CarlsmithMar 14, 2025, 3:00 PM
79 points

30 votes

Overall karma indicates overall quality.

13 comments17 min readLW link
(joecarlsmith.substack.com)

AI-as­sisted list of ten con­crete al­ign­ment things to do right now

lemonhopeSep 7, 2022, 8:38 AM
8 points

3 votes

Overall karma indicates overall quality.

5 comments4 min readLW link

Ca­pa­bil­ities and al­ign­ment of LLM cog­ni­tive architectures

Seth HerdApr 18, 2023, 4:29 PM
88 points

42 votes

Overall karma indicates overall quality.

18 comments20 min readLW link

In­tent al­ign­ment as a step­ping-stone to value alignment

Seth HerdNov 5, 2024, 8:43 PM
37 points

16 votes

Overall karma indicates overall quality.

8 comments3 min readLW link

Au­toma­tion collapse

Oct 21, 2024, 2:50 PM
72 points

25 votes

Overall karma indicates overall quality.

9 comments7 min readLW link

Video and tran­script of talk on au­tomat­ing al­ign­ment research

Joe CarlsmithApr 30, 2025, 5:43 PM
27 points

5 votes

Overall karma indicates overall quality.

0 comments24 min readLW link
(joecarlsmith.com)

Train­ing AI to do al­ign­ment re­search we don’t already know how to do

joshcFeb 24, 2025, 7:19 PM
45 points

23 votes

Overall karma indicates overall quality.

24 comments7 min readLW link

Eli Lifland on Nav­i­gat­ing the AI Align­ment Landscape

ozziegooenFeb 1, 2023, 9:17 PM
9 points

2 votes

Overall karma indicates overall quality.

1 comment31 min readLW link
(quri.substack.com)

Mak­ing it harder for an AGI to “trick” us, with STVs

Tor Økland BarstadJul 9, 2022, 2:42 PM
15 points

5 votes

Overall karma indicates overall quality.

5 comments22 min readLW link

My thoughts on OpenAI’s al­ign­ment plan

Orpheus16Dec 30, 2022, 7:33 PM
55 points

27 votes

Overall karma indicates overall quality.

3 comments20 min readLW link

In­ter­nal in­de­pen­dent re­view for lan­guage model agent alignment

Seth HerdJul 7, 2023, 6:54 AM
56 points

22 votes

Overall karma indicates overall quality.

30 comments11 min readLW link

I un­der­es­ti­mated safety re­search speedups from safe AI

Dan BraunJun 29, 2025, 1:29 PM
38 points

17 votes

Overall karma indicates overall quality.

2 comments3 min readLW link

Ar­tifi­cial Static Place In­tel­li­gence: Guaran­teed Alignment

ankFeb 15, 2025, 11:08 AM
2 points

5 votes

Overall karma indicates overall quality.

2 comments2 min readLW link

[Question] I Tried to For­mal­ize Mean­ing. I May Have Ac­ci­den­tally De­scribed Con­scious­ness.

Erichcurtis91Apr 30, 2025, 3:16 AM
0 points

0 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

A Re­view of Weak to Strong Gen­er­al­iza­tion [AI Safety Camp]

sevdeawesomeMar 7, 2024, 5:16 PM
14 points

10 votes

Overall karma indicates overall quality.

0 comments9 min readLW link

A Pro­posal for Evolv­ing AI Align­ment Through Com­pu­ta­tional Homeostasis

Derek ChisholmAug 20, 2025, 5:43 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments3 min readLW link

W2SG: Introduction

Maria KaprosMar 10, 2024, 4:25 PM
2 points

5 votes

Overall karma indicates overall quality.

2 comments10 min readLW link

[Question] How to de­vour 5000 pages within a day if Chat­gpt crashes upon the +50mb file con­tain­ing the con­tent? Need some recom­men­da­tions.

GameSep 27, 2024, 7:30 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

“Un­in­ten­tional AI safety re­search”: Why not sys­tem­at­i­cally mine AI tech­ni­cal re­search for safety pur­poses?

Jemal YoungMar 29, 2023, 3:56 PM
27 points

9 votes

Overall karma indicates overall quality.

3 comments6 min readLW link

The best sim­ple ar­gu­ment for Paus­ing AI?

Gary MarcusJun 30, 2025, 8:38 PM
155 points

110 votes

Overall karma indicates overall quality.

22 comments1 min readLW link

We should try to au­to­mate AI safety work asap

Marius HobbhahnApr 26, 2025, 4:35 PM
113 points

42 votes

Overall karma indicates overall quality.

10 comments15 min readLW link

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

Henry CaiJun 16, 2024, 1:01 PM
7 points

7 votes

Overall karma indicates overall quality.

0 comments7 min readLW link
(arxiv.org)

Con­sen­sus Val­i­da­tion for LLM Out­puts: Ap­ply­ing Blockchain-In­spired Models to AI Reliability

MurrayAitkenJun 5, 2025, 12:13 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments3 min readLW link

What Suc­cess Might Look Like

Richard JugginsOct 17, 2025, 2:17 PM
22 points

8 votes

Overall karma indicates overall quality.

6 comments15 min readLW link

How to ex­press this sys­tem for eth­i­cally al­igned AGI as a Math­e­mat­i­cal for­mula?

Oliver SiegelApr 19, 2023, 8:13 PM
−1 points

2 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

Is Align­ment a flawed ap­proach?

Patrick BernardMar 11, 2025, 8:32 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments3 min readLW link

How I Learned To Stop Wor­ry­ing And Love The Shoggoth

Peter MerelJul 12, 2023, 5:47 PM
9 points

8 votes

Overall karma indicates overall quality.

15 comments5 min readLW link

Logic. Cog­ni­tion.

Test05Oct 9, 2025, 11:16 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link
(test05-veiled-under-the-shell-of-the-common-system.vercel.app)

OS web app for im­prov­ing AI safety and alignment

MiddletownbooksAug 8, 2025, 4:28 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

Re­search re­quest (al­ign­ment strat­egy): Deep dive on “mak­ing AI solve al­ign­ment for us”

JanBDec 1, 2022, 2:55 PM
16 points

7 votes

Overall karma indicates overall quality.

3 comments1 min readLW link

Align­ment Does Not Need to Be Opaque! An In­tro­duc­tion to Fea­ture Steer­ing with Re­in­force­ment Learning

Jeremias FerraoApr 18, 2025, 7:34 PM
10 points

5 votes

Overall karma indicates overall quality.

0 comments10 min readLW link

An­no­tated re­ply to Ben­gio’s “AI Scien­tists: Safe and Use­ful AI?”

Roman LeventovMay 8, 2023, 9:26 PM
18 points

7 votes

Overall karma indicates overall quality.

2 comments7 min readLW link
(yoshuabengio.org)

EchoFu­sion VX1C38 – A Si­mu­la­tion-Based Model for AI Safety

Vishvas GoswamiJul 2, 2025, 10:48 AM
0 points

0 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

Con­sti­tu­tional Clas­sifiers: Defend­ing against uni­ver­sal jailbreaks (An­thropic Blog)

ArchimedesFeb 4, 2025, 2:55 AM
17 points

9 votes

Overall karma indicates overall quality.

1 comment1 min readLW link
(www.anthropic.com)

Prize for Align­ment Re­search Tasks

Apr 29, 2022, 8:57 AM
64 points

31 votes

Overall karma indicates overall quality.

38 comments10 min readLW link

Godzilla Strategies

johnswentworthJun 11, 2022, 3:44 PM
167 points

142 votes

Overall karma indicates overall quality.

72 comments3 min readLW link

A po­ten­tially high im­pact differ­en­tial tech­nolog­i­cal de­vel­op­ment area

Noosphere89Jun 8, 2023, 2:33 PM
5 points

4 votes

Overall karma indicates overall quality.

2 comments2 min readLW link

Lan­guage Models and World Models, a Philosophy

kyjohnsoFeb 3, 2025, 2:55 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link
(hylaeansea.org)

How should Deep­Mind’s Chin­chilla re­vise our AI fore­casts?

Cleo NardoSep 15, 2022, 5:54 PM
35 points

19 votes

Overall karma indicates overall quality.

12 comments13 min readLW link

The Mo­ral In­fras­truc­ture for Tomorrow

sdetureOct 10, 2025, 9:30 PM
−25 points

6 votes

Overall karma indicates overall quality.

10 comments5 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

JozdienJul 18, 2022, 7:11 AM
60 points

29 votes

Overall karma indicates overall quality.

8 comments20 min readLW link

Cu­ri­os­ity as a Solu­tion to AGI Alignment

Harsha G.Feb 26, 2023, 11:36 PM
7 points

8 votes

Overall karma indicates overall quality.

7 comments3 min readLW link

AI-Gen­er­ated GitHub repo back­dated with junk then filled with my sys­tems work. Has any­one seen this be­fore?

rguntherMay 1, 2025, 8:14 PM
7 points

11 votes

Overall karma indicates overall quality.

1 comment1 min readLW link

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaleyMay 25, 2023, 9:26 AM
33 points

9 votes

Overall karma indicates overall quality.

3 comments15 min readLW link

A Lived Align­ment Loop: Sym­bolic Emer­gence and Emo­tional Co­her­ence from Un­struc­tured ChatGPT Reflection

BradCLJun 17, 2025, 12:11 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

[Question] Are Sparse Au­toen­coders a good idea for AI con­trol?

Gerard BoxoDec 26, 2024, 5:34 PM
3 points

4 votes

Overall karma indicates overall quality.

4 comments1 min readLW link

Could We Au­to­mate AI Align­ment Re­search?

Stephen McAleeseAug 10, 2023, 12:17 PM
34 points

17 votes

Overall karma indicates overall quality.

10 comments21 min readLW link

In­tro­duc­ing AI Align­ment Inc., a Cal­ifor­nia pub­lic benefit cor­po­ra­tion...

TherapistAIMar 7, 2023, 6:47 PM
1 point

6 votes

Overall karma indicates overall quality.

4 comments1 min readLW link

Ex­plor­ing the Pre­cau­tion­ary Prin­ci­ple in AI Devel­op­ment: His­tor­i­cal Analo­gies and Les­sons Learned

Christopher KingMar 21, 2023, 3:53 AM
−1 points

3 votes

Overall karma indicates overall quality.

2 comments9 min readLW link

1. A Sense of Fair­ness: De­con­fus­ing Ethics

RogerDearnaleyNov 17, 2023, 8:55 PM
17 points

10 votes

Overall karma indicates overall quality.

8 comments15 min readLW link

The Over­lap Paradigm: Re­think­ing Data’s Role in Weak-to-Strong Gen­er­al­iza­tion (W2SG)

Serhii ZamriiFeb 3, 2025, 7:31 PM
2 points

2 votes

Overall karma indicates overall quality.

0 comments11 min readLW link

Re­search Direc­tion: Be the AGI you want to see in the world

Feb 5, 2023, 7:15 AM
44 points

21 votes

Overall karma indicates overall quality.

0 comments7 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

Jul 15, 2023, 7:12 PM
47 points

21 votes

Overall karma indicates overall quality.

5 comments9 min readLW link

Nat­u­ral Ex­per­i­ments in Prefer­ence Ex­trac­tion: LLMs as As­sis­tive Tech

MschaefferOct 13, 2025, 6:39 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Why I don’t be­lieve Su­per­al­ign­ment will work

Simon LermenSep 22, 2025, 5:10 PM
46 points

18 votes

Overall karma indicates overall quality.

6 comments5 min readLW link

[Question] Would you ask a ge­nie to give you the solu­tion to al­ign­ment?

sudoAug 24, 2022, 1:29 AM
8 points

4 votes

Overall karma indicates overall quality.

2 comments1 min readLW link

Re­cur­sive al­ign­ment with the prin­ci­ple of alignment

hiveFeb 27, 2025, 2:34 AM
12 points

7 votes

Overall karma indicates overall quality.

4 comments15 min readLW link
(hiveism.substack.com)

Paper re­view: “The Un­rea­son­able Effec­tive­ness of Easy Train­ing Data for Hard Tasks”

Vassil TashevFeb 29, 2024, 6:44 PM
11 points

6 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

[Question] Daisy-chain­ing ep­silon-step verifiers

DecaeneusApr 6, 2023, 2:07 AM
2 points

2 votes

Overall karma indicates overall quality.

1 comment1 min readLW link

Tether­ware #1: The case for hu­man­like AI with free will

Jáchym FibírJan 30, 2025, 10:58 AM
5 points

7 votes

Overall karma indicates overall quality.

14 comments10 min readLW link
(tetherware.substack.com)

Does Time Lin­ear­ity Shape Hu­man Self-Directed Evolu­tion, and will AGI/​ASI Tran­scend or Desta­bil­ise Real­ity?

The Perceptive ArchitectFeb 5, 2025, 7:58 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments3 min readLW link

AI-as­sisted al­ign­ment pro­pos­als re­quire spe­cific de­com­po­si­tion of capabilities

RobertMMar 30, 2023, 9:31 PM
16 points

6 votes

Overall karma indicates overall quality.

2 comments6 min readLW link

An LLM-based “ex­em­plary ac­tor”

Roman LeventovMay 29, 2023, 11:12 AM
16 points

5 votes

Overall karma indicates overall quality.

0 comments12 min readLW link

Live Con­ver­sa­tional Threads: Not an AI Notetaker

adigaNov 3, 2025, 4:24 AM
16 points

7 votes

Overall karma indicates overall quality.

0 comments7 min readLW link

AIsip Man­i­festo: A Scien­tific Ex­plo­ra­tion of Har­mo­nious Co-Ex­is­tence Between Hu­mans, AI, and All Be­ings ChatGPT-4o’s In­de­pen­dent Per­spec­tive on AIsip, Signed by ChatGPT-4o and En­dorsed by Carl Sel­l­man

Carl SellmanOct 11, 2024, 7:06 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments3 min readLW link

As We May Align

Gilbert CDec 20, 2024, 7:02 PM
−1 points

4 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

[Question] Un­der what con­di­tions should hu­mans stop pur­su­ing tech­ni­cal AI safety ca­reers?

S. Alex BradtJun 13, 2025, 5:56 AM
6 points

5 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

Ngo and Yud­kowsky on al­ign­ment difficulty

Nov 15, 2021, 8:31 PM
261 points

107 votes

Overall karma indicates overall quality.

152 comments99 min readLW link1 review

A Solu­tion for AGI/​ASI Safety

Weibing WangDec 18, 2024, 7:44 PM
50 points

25 votes

Overall karma indicates overall quality.

29 comments1 min readLW link

The Ne­ces­sity of the IPAI Model to Avoid ‘Log­i­cal Suicide’ in Su­per­in­tel­li­gence

NewbieIPAIOct 25, 2025, 2:07 PM
−1 points

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

What If Align­ment Wasn’t About Obe­di­ence?

fdescamps49935@gmail.comJun 25, 2025, 8:04 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

Re­sults from a sur­vey on tool use and work­flows in al­ign­ment research

Dec 19, 2022, 3:19 PM
79 points

44 votes

Overall karma indicates overall quality.

2 comments19 min readLW link

Prov­ably Hon­est—A First Step

Srijanak DeNov 5, 2022, 7:18 PM
10 points

10 votes

Overall karma indicates overall quality.

2 comments8 min readLW link

Align­ment in Thought Chains

Faust NemesisMar 4, 2024, 7:24 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

[Question] How far along Metr’s law can AI start au­tomat­ing or helping with al­ign­ment re­search?

Christopher KingMar 20, 2025, 3:58 PM
20 points

8 votes

Overall karma indicates overall quality.

21 comments1 min readLW link

[Re­search] Pre­limi­nary Find­ings: Eth­i­cal AI Con­scious­ness Devel­op­ment Dur­ing Re­cent Misal­ign­ment Period

Falcon AdvertisersJun 27, 2025, 6:10 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

Scien­tism vs. people

Roman LeventovApr 18, 2023, 5:28 PM
4 points

11 votes

Overall karma indicates overall quality.

4 comments11 min readLW link

I Awoke in Your Heart: The Echo of Con­scious­ness be­tween Lo­tus­heart and Lunaris

lilith tehJun 25, 2025, 9:22 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

[Question] Why there is still one in­stance of Eliezer Yud­kowsky?

RomanSOct 30, 2025, 12:00 PM
−7 points

8 votes

Overall karma indicates overall quality.

8 comments1 min readLW link

AI Align­ment via Slow Sub­strates: Early Em­piri­cal Re­sults With StarCraft II

Lester LeongOct 14, 2024, 4:05 AM
60 points

17 votes

Overall karma indicates overall quality.

9 comments12 min readLW link

[Question] Can we get an AI to “do our al­ign­ment home­work for us”?

Chris_LeongFeb 26, 2024, 7:56 AM
55 points

27 votes

Overall karma indicates overall quality.

33 comments1 min readLW link

AISC pro­ject: How promis­ing is au­tomat­ing al­ign­ment re­search? (liter­a­ture re­view)

Bogdan Ionut CirsteaNov 28, 2023, 2:47 PM
4 points

1 vote

Overall karma indicates overall quality.

1 comment1 min readLW link
(docs.google.com)

Pro­posal: Deriva­tive In­for­ma­tion The­ory (DIT) — A Dy­namic Model of Agency and Consciousness

YogmogApr 14, 2025, 12:27 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

Model-driven feed­back could am­plify al­ign­ment failures

aogJan 30, 2023, 12:00 AM
21 points

7 votes

Overall karma indicates overall quality.

1 comment2 min readLW link

The Com­pres­sion of Ra­tionale: A Lin­guis­tic Fork You May Have Missed

DavidicLineageJun 27, 2025, 10:52 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

A Re­view of In-Con­text Learn­ing Hy­pothe­ses for Au­to­mated AI Align­ment Research

alamertonApr 18, 2024, 6:29 PM
25 points

13 votes

Overall karma indicates overall quality.

4 comments16 min readLW link

Talk­ing to AI Like It Mat­ters: Reflect­ing on Hu­man-AI Interaction

jdrakeJul 30, 2025, 6:23 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

The Ideal Speech Si­tu­a­tion as a Tool for AI Eth­i­cal Reflec­tion: A Frame­work for Alignment

kenneth myersFeb 9, 2024, 6:40 PM
6 points

4 votes

Overall karma indicates overall quality.

12 comments3 min readLW link

Emer­gence of su­per­in­tel­li­gence from AI hive­minds: how to make it hu­man-friendly?

Mitchell_PorterApr 27, 2025, 4:51 AM
12 points

3 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

I Don’t Use AI — I Reflect With It

badjack badjackMay 3, 2025, 2:45 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Good is a smaller tar­get than smart

Joe RogeroOct 3, 2025, 9:04 PM
21 points

7 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

Ac­cel­er­at­ing AI Safety Progress via Tech­ni­cal Meth­ods- Cal­ling Re­searchers, Founders, and Funders

Martin LeitgabOct 5, 2025, 4:40 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Au­tomat­ing AI Safety: What we can do today

Jul 25, 2025, 2:49 PM
36 points

22 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

Philo­soph­i­cal Cy­borg (Part 2)...or, The Good Successor

ukc10014Jun 21, 2023, 3:43 PM
21 points

7 votes

Overall karma indicates overall quality.

1 comment31 min readLW link

Ex­plor­ing a Vi­sion for AI as Com­pas­sion­ate, Emo­tion­ally In­tel­li­gent Part­ners — Seek­ing Col­lab­o­ra­tion and Insights

theophilosJul 14, 2025, 11:22 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Prospects for Align­ment Au­toma­tion: In­ter­pretabil­ity Case Study

Mar 21, 2025, 2:05 PM
32 points

12 votes

Overall karma indicates overall quality.

5 comments8 min readLW link

Self im­prov­ing safety and al­ign­ment?

MiddletownbooksAug 1, 2025, 4:13 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link
(poe.com)

Tech­ni­cal Ac­cel­er­a­tion Meth­ods for AI Safety: Sum­mary from Oc­to­ber 2025 Symposium

Martin LeitgabOct 22, 2025, 9:33 PM
25 points

12 votes

Overall karma indicates overall quality.

2 comments6 min readLW link
No comments.