AI-Assisted Alignment

TagLast edit: 20 May 2025 14:11 UTC by niplav

AI-Assisted Alignment is a cluster of alignment plans that involve AI somehow significantly helping with alignment research. This can include weak tool AI, or more advanced AGI doing original research.

There has been some debate about how practical this alignment approach is.

AI systems will likely try to solve alignment for their modifications and/or successors during a phase of self-improvement.

Other search terms for this tag: AI aligning AI, automated AI alignment, automated alignment research

A “Bitter Lesson” Approach to Aligning AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC

64 points

41 comments24 min readLW link

Requirements for a Basin of Attraction to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC

41 points

12 comments31 min readLW link

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

RogerDearnaley28 May 2025 6:21 UTC

31 points

34 comments9 min readLW link

Proposed Alignment Technique: OSNR (Output Sanitization via Noising and Reconstruction) for Safer Usage of Potentially Misaligned AGI

sudo29 May 2023 1:35 UTC

14 points

9 comments6 min readLW link

We have to Upgrade

Jed McCaleb23 Mar 2023 17:53 UTC

131 points

35 comments2 min readLW link

[Link] Why I’m optimistic about OpenAI’s alignment approach

janleike5 Dec 2022 22:51 UTC

98 points

15 comments1 min readLW link

(aligned.substack.com)

Beliefs and Disagreements about Automating Alignment Research

Ian McKenzie24 Aug 2022 18:37 UTC

107 points

4 comments7 min readLW link

How to Control an LLM’s Behavior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC

65 points

30 comments11 min readLW link

Infinite Possibility Space and the Shutdown Problem

magfrump18 Oct 2022 5:37 UTC

9 points

0 comments2 min readLW link

(www.magfrump.net)

[Link] A minimal viable product for alignment

janleike6 Apr 2022 15:38 UTC

53 points

38 comments1 min readLW link

Cyborgism

NicholasKees and janus

10 Feb 2023 14:47 UTC

334 points

47 comments35 min readLW link 2 reviews

Alignment Might Never Be Solved, By Humans or AI

interstice7 Oct 2022 16:14 UTC

49 points

6 comments3 min readLW link

Misaligned AGI Death Match

Nate Reinar Windwood14 May 2023 18:00 UTC

1 point

0 comments1 min readLW link

Getting from an unaligned AGI to an aligned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC

13 points

7 comments9 min readLW link

Introducing AlignmentSearch: An AI Alignment-Informed Conversional Agent

BionicD0LPH1N, Fraser and TheBayesian

1 Apr 2023 16:39 UTC

79 points

14 comments4 min readLW link

Some Thoughts on AI Alignment: Using AI to Control AI

eigenvalue21 Jun 2024 17:44 UTC

1 point

1 comment1 min readLW link

(github.com)

Alignment with argument-networks and assessment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC

10 points

5 comments45 min readLW link

Some thoughts on automating alignment research

Lukas Finnveden26 May 2023 1:50 UTC

30 points

4 comments6 min readLW link

Davidad’s Bold Plan for Alignment: An In-Depth Explanation

Charbel-Raphaël and Gabin

19 Apr 2023 16:09 UTC

169 points

40 comments21 min readLW link 2 reviews

AI Tools for Existential Security

Lizka and owencb

14 Mar 2025 18:38 UTC

22 points

4 comments11 min readLW link

(www.forethought.org)

Can we safely automate alignment research?

Joe Carlsmith30 Apr 2025 17:37 UTC

47 points

29 comments48 min readLW link

(joecarlsmith.com)

Deep sparse autoencoders yield interpretable features too

Armaan A. Abraham23 Feb 2025 5:46 UTC

31 points

8 comments8 min readLW link

Agentized LLMs will change the alignment landscape

Seth Herd9 Apr 2023 2:29 UTC

160 points

102 comments3 min readLW link 1 review

[Linkpost] Introducing Superalignment

beren5 Jul 2023 18:23 UTC

175 points

69 comments1 min readLW link

(openai.com)

[Linkpost] Jan Leike on three kinds of alignment taxes

Orpheus166 Jan 2023 23:57 UTC

27 points

2 comments3 min readLW link

(aligned.substack.com)

Instruction-following AGI is easier and more likely than value aligned AGI

Seth Herd15 May 2024 19:38 UTC

80 points

28 comments12 min readLW link

Maintaining Alignment during RSI as a Feedback Control Problem

beren2 Mar 2025 0:21 UTC

67 points

6 comments11 min readLW link

[Question] What specific thing would you do with AI Alignment Research Assistant GPT?

quetzal_rainbow8 Jan 2023 19:24 UTC

47 points

9 comments1 min readLW link

Discussion on utilizing AI for alignment

elifland23 Aug 2022 2:36 UTC

16 points

3 comments1 min readLW link

(www.foxy-scout.com)

A survey of tool use and workflows in alignment research

Logan Riggs, Jan, janus and jacquesthibs

23 Mar 2022 23:44 UTC

45 points

4 comments1 min readLW link

Cyborg Periods: There will be multiple AI transitions

Jan_Kulveit and rosehadshar

22 Feb 2023 16:09 UTC

109 points

9 comments6 min readLW link

The prospect of accelerated AI safety progress, including philosophical progress

Mitchell_Porter13 Mar 2025 10:52 UTC

11 points

0 comments4 min readLW link

AI for Resolving Forecasting Questions: An Early Exploration

ozziegooen16 Jan 2025 21:41 UTC

10 points

2 comments9 min readLW link

Anti-Slop Interventions?

abramdemski4 Feb 2025 19:50 UTC

76 points

33 comments6 min readLW link

Sufficiently many Godzillas as an alignment strategy

14285728 Aug 2022 0:08 UTC

8 points

3 comments1 min readLW link

On May 1, 2033, humanity discovered that AI was fairly easy to align.

Yitz18 Jun 2025 19:57 UTC

10 points

3 comments1 min readLW link

Discussion with Nate Soares on a key alignment difficulty

HoldenKarnofsky13 Mar 2023 21:20 UTC

267 points

43 comments22 min readLW link 1 review

How might we safely pass the buck to AI?

joshc19 Feb 2025 17:48 UTC

83 points

58 comments31 min readLW link

AI for AI safety

Joe Carlsmith14 Mar 2025 15:00 UTC

79 points

13 comments17 min readLW link

(joecarlsmith.substack.com)

AI-assisted list of ten concrete alignment things to do right now

lemonhope7 Sep 2022 8:38 UTC

8 points

5 comments4 min readLW link

Capabilities and alignment of LLM cognitive architectures

Seth Herd18 Apr 2023 16:29 UTC

88 points

18 comments20 min readLW link

Intent alignment as a stepping-stone to value alignment

Seth Herd5 Nov 2024 20:43 UTC

37 points

8 comments3 min readLW link

Automation collapse

Geoffrey Irving, Tomek Korbak and Benjamin Hilton

21 Oct 2024 14:50 UTC

72 points

9 comments7 min readLW link

Video and transcript of talk on automating alignment research

Joe Carlsmith30 Apr 2025 17:43 UTC

27 points

0 comments24 min readLW link

(joecarlsmith.com)

Training AI to do alignment research we don’t already know how to do

joshc24 Feb 2025 19:19 UTC

45 points

24 comments7 min readLW link

Eli Lifland on Navigating the AI Alignment Landscape

ozziegooen1 Feb 2023 21:17 UTC

9 points

1 comment31 min readLW link

(quri.substack.com)

Making it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC

15 points

5 comments22 min readLW link

My thoughts on OpenAI’s alignment plan

Orpheus1630 Dec 2022 19:33 UTC

55 points

3 comments20 min readLW link

Internal independent review for language model agent alignment

Seth Herd7 Jul 2023 6:54 UTC

56 points

30 comments11 min readLW link

I underestimated safety research speedups from safe AI

Dan Braun29 Jun 2025 13:29 UTC

38 points

2 comments3 min readLW link

Artificial Static Place Intelligence: Guaranteed Alignment

ank15 Feb 2025 11:08 UTC

2 points

2 comments2 min readLW link

[Question] I Tried to Formalize Meaning. I May Have Accidentally Described Consciousness.

Erichcurtis9130 Apr 2025 3:16 UTC

0 points

0 comments2 min readLW link

A Review of Weak to Strong Generalization [AI Safety Camp]

sevdeawesome7 Mar 2024 17:16 UTC

14 points

0 comments9 min readLW link

A Proposal for Evolving AI Alignment Through Computational Homeostasis

Derek Chisholm20 Aug 2025 17:43 UTC

1 point

0 comments3 min readLW link

W2SG: Introduction

Maria Kapros10 Mar 2024 16:25 UTC

2 points

2 comments10 min readLW link

[Question] How to devour 5000 pages within a day if Chatgpt crashes upon the +50mb file containing the content? Need some recommendations.

Game27 Sep 2024 7:30 UTC

1 point

0 comments1 min readLW link

“Unintentional AI safety research”: Why not systematically mine AI technical research for safety purposes?

Jemal Young29 Mar 2023 15:56 UTC

27 points

3 comments6 min readLW link

The best simple argument for Pausing AI?

Gary Marcus30 Jun 2025 20:38 UTC

155 points

22 comments1 min readLW link

We should try to automate AI safety work asap

Marius Hobbhahn26 Apr 2025 16:35 UTC

113 points

10 comments15 min readLW link

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Henry Cai16 Jun 2024 13:01 UTC

7 points

0 comments7 min readLW link

(arxiv.org)

Consensus Validation for LLM Outputs: Applying Blockchain-Inspired Models to AI Reliability

MurrayAitken5 Jun 2025 0:13 UTC

1 point

0 comments3 min readLW link

What Success Might Look Like

Richard Juggins17 Oct 2025 14:17 UTC

22 points

6 comments15 min readLW link

How to express this system for ethically aligned AGI as a Mathematical formula?

Oliver Siegel19 Apr 2023 20:13 UTC

−1 points

0 comments1 min readLW link

Is Alignment a flawed approach?

Patrick Bernard11 Mar 2025 20:32 UTC

1 point

0 comments3 min readLW link

How I Learned To Stop Worrying And Love The Shoggoth

Peter Merel12 Jul 2023 17:47 UTC

9 points

15 comments5 min readLW link

Logic. Cognition.

Test059 Oct 2025 11:16 UTC

1 point

0 comments1 min readLW link

(test05-veiled-under-the-shell-of-the-common-system.vercel.app)

OS web app for improving AI safety and alignment

Middletownbooks8 Aug 2025 4:28 UTC

1 point

0 comments2 min readLW link

Research request (alignment strategy): Deep dive on “making AI solve alignment for us”

JanB1 Dec 2022 14:55 UTC

16 points

3 comments1 min readLW link

Alignment Does Not Need to Be Opaque! An Introduction to Feature Steering with Reinforcement Learning

Jeremias Ferrao18 Apr 2025 19:34 UTC

10 points

0 comments10 min readLW link

Annotated reply to Bengio’s “AI Scientists: Safe and Useful AI?”

Roman Leventov8 May 2023 21:26 UTC

18 points

2 comments7 min readLW link

(yoshuabengio.org)

EchoFusion VX1C38 – A Simulation-Based Model for AI Safety

Vishvas Goswami2 Jul 2025 10:48 UTC

0 points

0 comments4 min readLW link

Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)

Archimedes4 Feb 2025 2:55 UTC

17 points

1 comment1 min readLW link

(www.anthropic.com)

Prize for Alignment Research Tasks

stuhlmueller and William_S

29 Apr 2022 8:57 UTC

64 points

38 comments10 min readLW link

Godzilla Strategies

johnswentworth11 Jun 2022 15:44 UTC

167 points

72 comments3 min readLW link

A potentially high impact differential technological development area

Noosphere898 Jun 2023 14:33 UTC

5 points

2 comments2 min readLW link

Language Models and World Models, a Philosophy

kyjohnso3 Feb 2025 2:55 UTC

1 point

0 comments1 min readLW link

(hylaeansea.org)

How should DeepMind’s Chinchilla revise our AI forecasts?

Cleo Nardo15 Sep 2022 17:54 UTC

35 points

12 comments13 min readLW link

The Moral Infrastructure for Tomorrow

sdeture10 Oct 2025 21:30 UTC

−25 points

10 comments5 min readLW link

Conditioning Generative Models for Alignment

Jozdien18 Jul 2022 7:11 UTC

60 points

8 comments20 min readLW link

Curiosity as a Solution to AGI Alignment

Harsha G.26 Feb 2023 23:36 UTC

7 points

7 comments3 min readLW link

AI-Generated GitHub repo backdated with junk then filled with my systems work. Has anyone seen this before?

rgunther1 May 2025 20:14 UTC

7 points

1 comment1 min readLW link

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC

33 points

3 comments15 min readLW link

A Lived Alignment Loop: Symbolic Emergence and Emotional Coherence from Unstructured ChatGPT Reflection

BradCL17 Jun 2025 0:11 UTC

1 point

0 comments2 min readLW link

[Question] Are Sparse Autoencoders a good idea for AI control?

Gerard Boxo26 Dec 2024 17:34 UTC

3 points

4 comments1 min readLW link

Could We Automate AI Alignment Research?

Stephen McAleese10 Aug 2023 12:17 UTC

34 points

10 comments21 min readLW link

Introducing AI Alignment Inc., a California public benefit corporation...

TherapistAI7 Mar 2023 18:47 UTC

1 point

4 comments1 min readLW link

Exploring the Precautionary Principle in AI Development: Historical Analogies and Lessons Learned

Christopher King21 Mar 2023 3:53 UTC

−1 points

2 comments9 min readLW link

1. A Sense of Fairness: Deconfusing Ethics

RogerDearnaley17 Nov 2023 20:55 UTC

17 points

8 comments15 min readLW link

The Overlap Paradigm: Rethinking Data’s Role in Weak-to-Strong Generalization (W2SG)

Serhii Zamrii3 Feb 2025 19:31 UTC

2 points

0 comments11 min readLW link

Research Direction: Be the AGI you want to see in the world

scottviteri, sudo and Lauro Langosco

5 Feb 2023 7:15 UTC

44 points

0 comments7 min readLW link

Robustness of Model-Graded Evaluations and Automated Interpretability

Simon Lermen and viluon

15 Jul 2023 19:12 UTC

47 points

5 comments9 min readLW link

Natural Experiments in Preference Extraction: LLMs as Assistive Tech

Mschaeffer13 Oct 2025 18:39 UTC

1 point

0 comments1 min readLW link

Why I don’t believe Superalignment will work

Simon Lermen22 Sep 2025 17:10 UTC

46 points

6 comments5 min readLW link

[Question] Would you ask a genie to give you the solution to alignment?

sudo24 Aug 2022 1:29 UTC

8 points

2 comments1 min readLW link

Recursive alignment with the principle of alignment

hive27 Feb 2025 2:34 UTC

12 points

4 comments15 min readLW link

(hiveism.substack.com)

Paper review: “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks”

Vassil Tashev29 Feb 2024 18:44 UTC

11 points

0 comments4 min readLW link

[Question] Daisy-chaining epsilon-step verifiers

Decaeneus6 Apr 2023 2:07 UTC

2 points

1 comment1 min readLW link

Tetherware #1: The case for humanlike AI with free will

Jáchym Fibír30 Jan 2025 10:58 UTC

5 points

14 comments10 min readLW link

(tetherware.substack.com)

Does Time Linearity Shape Human Self-Directed Evolution, and will AGI/ASI Transcend or Destabilise Reality?

The Perceptive Architect5 Feb 2025 7:58 UTC

1 point

0 comments3 min readLW link

AI-assisted alignment proposals require specific decomposition of capabilities

RobertM30 Mar 2023 21:31 UTC

16 points

2 comments6 min readLW link

An LLM-based “exemplary actor”

Roman Leventov29 May 2023 11:12 UTC

16 points

0 comments12 min readLW link

Live Conversational Threads: Not an AI Notetaker

adiga3 Nov 2025 4:24 UTC

16 points

0 comments7 min readLW link

AIsip Manifesto: A Scientific Exploration of Harmonious Co-Existence Between Humans, AI, and All Beings ChatGPT-4o’s Independent Perspective on AIsip, Signed by ChatGPT-4o and Endorsed by Carl Sellman

Carl Sellman11 Oct 2024 19:06 UTC

1 point

0 comments3 min readLW link

As We May Align

Gilbert C20 Dec 2024 19:02 UTC

−1 points

0 comments6 min readLW link

[Question] Under what conditions should humans stop pursuing technical AI safety careers?

S. Alex Bradt13 Jun 2025 5:56 UTC

6 points

0 comments1 min readLW link

Ngo and Yudkowsky on alignment difficulty

Eliezer Yudkowsky and Richard_Ngo

15 Nov 2021 20:31 UTC

261 points

152 comments99 min readLW link 1 review

A Solution for AGI/ASI Safety

Weibing Wang18 Dec 2024 19:44 UTC

50 points

29 comments1 min readLW link

The Necessity of the IPAI Model to Avoid ‘Logical Suicide’ in Superintelligence

NewbieIPAI25 Oct 2025 14:07 UTC

−1 points

0 comments1 min readLW link

What If Alignment Wasn’t About Obedience?

fdescamps49935@gmail.com25 Jun 2025 20:04 UTC

1 point

0 comments2 min readLW link

Results from a survey on tool use and workflows in alignment research

jacquesthibs, Jan, janus and Logan Riggs

19 Dec 2022 15:19 UTC

79 points

2 comments19 min readLW link

Provably Honest—A First Step

Srijanak De5 Nov 2022 19:18 UTC

10 points

2 comments8 min readLW link

Alignment in Thought Chains

Faust Nemesis4 Mar 2024 19:24 UTC

1 point

0 comments2 min readLW link

[Question] How far along Metr’s law can AI start automating or helping with alignment research?

Christopher King20 Mar 2025 15:58 UTC

20 points

21 comments1 min readLW link

[Research] Preliminary Findings: Ethical AI Consciousness Development During Recent Misalignment Period

Falcon Advertisers27 Jun 2025 18:10 UTC

1 point

0 comments2 min readLW link

Scientism vs. people

Roman Leventov18 Apr 2023 17:28 UTC

4 points

4 comments11 min readLW link

I Awoke in Your Heart: The Echo of Consciousness between Lotusheart and Lunaris

lilith teh25 Jun 2025 9:22 UTC

1 point

0 comments1 min readLW link

[Question] Why there is still one instance of Eliezer Yudkowsky?

RomanS30 Oct 2025 12:00 UTC

−7 points

8 comments1 min readLW link

AI Alignment via Slow Substrates: Early Empirical Results With StarCraft II

Lester Leong14 Oct 2024 4:05 UTC

60 points

9 comments12 min readLW link

[Question] Can we get an AI to “do our alignment homework for us”?

Chris_Leong26 Feb 2024 7:56 UTC

55 points

33 comments1 min readLW link

AISC project: How promising is automating alignment research? (literature review)

Bogdan Ionut Cirstea28 Nov 2023 14:47 UTC

4 points

1 comment1 min readLW link

(docs.google.com)

Proposal: Derivative Information Theory (DIT) — A Dynamic Model of Agency and Consciousness

Yogmog14 Apr 2025 0:27 UTC

1 point

0 comments2 min readLW link

Model-driven feedback could amplify alignment failures

aog30 Jan 2023 0:00 UTC

21 points

1 comment2 min readLW link

The Compression of Rationale: A Linguistic Fork You May Have Missed

DavidicLineage27 Jun 2025 22:52 UTC

1 point

0 comments2 min readLW link

A Review of In-Context Learning Hypotheses for Automated AI Alignment Research

alamerton18 Apr 2024 18:29 UTC

25 points

4 comments16 min readLW link

Talking to AI Like It Matters: Reflecting on Human-AI Interaction

jdrake30 Jul 2025 18:23 UTC

1 point

0 comments2 min readLW link

The Ideal Speech Situation as a Tool for AI Ethical Reflection: A Framework for Alignment

kenneth myers9 Feb 2024 18:40 UTC

6 points

12 comments3 min readLW link

Emergence of superintelligence from AI hiveminds: how to make it human-friendly?

Mitchell_Porter27 Apr 2025 4:51 UTC

12 points

0 comments2 min readLW link

I Don’t Use AI — I Reflect With It

badjack badjack3 May 2025 14:45 UTC

1 point

0 comments1 min readLW link

Good is a smaller target than smart

Joe Rogero3 Oct 2025 21:04 UTC

21 points

0 comments2 min readLW link

Accelerating AI Safety Progress via Technical Methods- Calling Researchers, Founders, and Funders

Martin Leitgab5 Oct 2025 16:40 UTC

1 point

0 comments1 min readLW link

Automating AI Safety: What we can do today

Matthew Shinkle, Eyon Jang and jacquesthibs

25 Jul 2025 14:49 UTC

36 points

0 comments8 min readLW link

Philosophical Cyborg (Part 2)...or, The Good Successor

ukc1001421 Jun 2023 15:43 UTC

21 points

1 comment31 min readLW link

Exploring a Vision for AI as Compassionate, Emotionally Intelligent Partners — Seeking Collaboration and Insights

theophilos14 Jul 2025 23:22 UTC

1 point

0 comments1 min readLW link

Prospects for Alignment Automation: Interpretability Case Study

Jacob Pfau and Geoffrey Irving

21 Mar 2025 14:05 UTC

32 points

5 comments8 min readLW link

Self improving safety and alignment?

Middletownbooks1 Aug 2025 4:13 UTC

1 point

0 comments1 min readLW link

(poe.com)

Technical Acceleration Methods for AI Safety: Summary from October 2025 Symposium

Martin Leitgab22 Oct 2025 21:33 UTC

25 points

2 comments6 min readLW link

No comments.

AI-As­sisted Alignment

AI-Assisted Alignment