RSS

Aligned AI Proposals

TagLast edit: 4 Jan 2025 22:05 UTC by Dakara

Aligned AI Proposals are proposals aimed at ensuring artificial intelligence systems behave in accordance with human intentions (intent alignment) or human values (value alignment).

The main goal of these proposals is to ensure that AI systems will, all things considered, benefit humanity.

The Best Way to Align an LLM: Is In­ner Align­ment Now a Solved Prob­lem?

RogerDearnaley28 May 2025 6:21 UTC
31 points
34 comments9 min readLW link

Why Align­ing an LLM is Hard, and How to Make it Easier

RogerDearnaley23 Jan 2025 6:44 UTC
34 points
3 comments4 min readLW link

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC
64 points
41 comments24 min readLW link

Align­ment has a Basin of At­trac­tion: Beyond the Orthog­o­nal­ity Thesis

RogerDearnaley1 Feb 2024 21:15 UTC
16 points
15 comments13 min readLW link

Good­bye, Shog­goth: The Stage, its An­i­ma­tron­ics, & the Pup­peteer – a New Metaphor

RogerDearnaley9 Jan 2024 20:42 UTC
48 points
8 comments36 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC
41 points
12 comments31 min readLW link

Mo­ti­vat­ing Align­ment of LLM-Pow­ered Agents: Easy for AGI, Hard for ASI?

RogerDearnaley11 Jan 2024 12:56 UTC
35 points
4 comments39 min readLW link

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC
37 points
4 comments2 min readLW link

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC
65 points
30 comments11 min readLW link

In­ter­pret­ing the Learn­ing of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC
30 points
14 comments9 min readLW link

Safety First: safety be­fore full al­ign­ment. The de­on­tic suffi­ciency hy­poth­e­sis.

Chris Lakin3 Jan 2024 17:55 UTC
48 points
3 comments3 min readLW link

[Linkpost] Build­ing Altru­is­tic and Mo­ral AI Agent with Brain-in­spired Affec­tive Em­pa­thy Mechanisms

Gunnar_Zarncke4 Nov 2024 10:15 UTC
13 points
0 comments1 min readLW link
(arxiv.org)

Desider­ata for an AI

Nathan Helm-Burger19 Jul 2023 16:18 UTC
9 points
0 comments4 min readLW link

Two paths to win the AGI transition

Nathan Helm-Burger6 Jul 2023 21:59 UTC
11 points
8 comments4 min readLW link

How might we solve the al­ign­ment prob­lem? (Part 1: In­tro, sum­mary, on­tol­ogy)

Joe Carlsmith28 Oct 2024 21:57 UTC
54 points
5 comments32 min readLW link

We have promis­ing al­ign­ment plans with low taxes

Seth Herd10 Nov 2023 18:51 UTC
46 points
9 comments5 min readLW link

A Non­con­struc­tive Ex­is­tence Proof of Aligned Superintelligence

Roko12 Sep 2024 3:20 UTC
0 points
80 comments1 min readLW link
(transhumanaxiology.substack.com)

AI Align­ment Metastrategy

Vanessa Kosoy31 Dec 2023 12:06 UTC
125 points
13 comments7 min readLW link

A list of core AI safety prob­lems and how I hope to solve them

davidad26 Aug 2023 15:12 UTC
165 points
29 comments5 min readLW link

The (par­tial) fal­lacy of dumb superintelligence

Seth Herd18 Oct 2023 21:25 UTC
38 points
5 comments4 min readLW link

A Pro­posal for AI Align­ment: Us­ing Directly Op­pos­ing Models

Arne B27 Apr 2023 18:05 UTC
0 points
5 comments3 min readLW link

Com­plete Elimi­na­tion of In­stru­men­tal Self-Preser­va­tion Across AI Ar­chi­tec­tures: Cross-Model Val­i­da­tion from 4,312 Ad­ver­sar­ial Scenarios

David Fortin-Dominguez14 Oct 2025 1:04 UTC
1 point
0 comments20 min readLW link

Is In­ter­pretabil­ity All We Need?

RogerDearnaley14 Nov 2023 5:31 UTC
1 point
1 comment1 min readLW link

Give Neo a Chance

ank6 Mar 2025 1:48 UTC
3 points
7 comments7 min readLW link

aim­less ace an­a­lyzes ac­tive am­a­teur: a micro-aaaaal­ign­ment proposal

lemonhope21 Jul 2024 12:37 UTC
12 points
0 comments1 min readLW link

Pro­posal: Align Sys­tems Ear­lier In Training

OneManyNone16 May 2023 16:24 UTC
18 points
0 comments11 min readLW link

The Road to Evil Is Paved with Good Ob­jec­tives: Frame­work to Clas­sify and Fix Misal­ign­ments.

Shivam30 Jan 2025 2:44 UTC
1 point
0 comments11 min readLW link

Po­ten­tial­ism: A Pro­posal for Redefin­ing Ethics in the Modern Age of AI

Emad22 Nov 2025 20:14 UTC
1 point
0 comments1 min readLW link

Align­ment in Thought Chains

Faust Nemesis4 Mar 2024 19:24 UTC
1 point
0 comments2 min readLW link

De­sign­ing Hu­man-Like Con­scious­ness for AGI

Yu Tian18 Jun 2025 9:47 UTC
1 point
0 comments17 min readLW link

The Real AI Threat: Com­fortable Obsolescence

Andrei Navrotskii14 Nov 2025 23:49 UTC
1 point
0 comments15 min readLW link

Embed­ding Eth­i­cal Pri­ors into AI Sys­tems: A Bayesian Approach

Justausername3 Aug 2023 15:31 UTC
−5 points
3 comments21 min readLW link

Au­tonomous Align­ment Over­sight Frame­work (AAOF)

Justausername25 Jul 2023 10:25 UTC
−9 points
0 comments4 min readLW link

In­fer­ence-Time Epistemic Con­trol Layer for LLM Reliability

Michael Montplaisir19 Nov 2025 22:53 UTC
1 point
0 comments2 min readLW link

The V&V method—A step to­wards safer AGI

Yoav Hollander24 Jun 2025 13:42 UTC
20 points
1 comment1 min readLW link
(blog.foretellix.com)

Align­ing an H-JEPA agent via train­ing on the out­puts of an LLM-based “ex­em­plary ac­tor”

Roman Leventov29 May 2023 11:08 UTC
12 points
10 comments30 min readLW link

A Min­i­mal Suffi­cient Sub­strate for Phenom­e­nal Con­scious­ness

EventHorizonObserver1 Dec 2025 6:05 UTC
1 point
0 comments1 min readLW link

Spec­u­la­tion on map­ping the moral land­scape for fu­ture Ai Alignment

Sven Heinz (Welwordion)16 Apr 2023 13:43 UTC
1 point
0 comments1 min readLW link

Lifel­og­ging for Align­ment & Immortality

Dev.Errata17 Aug 2024 23:42 UTC
13 points
3 comments7 min readLW link

Up­date on Devel­op­ing an Ethics Calcu­la­tor to Align an AGI to

sweenesm12 Mar 2024 12:33 UTC
4 points
2 comments8 min readLW link

[Question] Share AI Safety Ideas: Both Crazy and Not

ank1 Mar 2025 19:08 UTC
17 points
28 comments1 min readLW link

An LLM-based “ex­em­plary ac­tor”

Roman Leventov29 May 2023 11:12 UTC
16 points
0 comments12 min readLW link

The Co­op­er­a­tive Cen­tral Plan­ning Hand­shake: A Pos­i­tive-Sum Basilisk for ASI Alignment

David Noel Ng5 Nov 2025 10:16 UTC
1 point
0 comments8 min readLW link

Ar­tifi­cial Static Place In­tel­li­gence: Guaran­teed Alignment

ank15 Feb 2025 11:08 UTC
2 points
2 comments2 min readLW link

Thin Align­ment Can’t Solve Thick Problems

Daan Henselmans27 Apr 2025 22:42 UTC
11 points
2 comments9 min readLW link

AI Safety, Align­ment, and Ethics (AI SAE)

Dylan Waldner18 Oct 2025 4:17 UTC
1 point
0 comments5 min readLW link
(arxiv.org)

Toward a Hu­man Hy­brid Lan­guage for En­hanced Hu­man-Ma­chine Com­mu­ni­ca­tion: Ad­dress­ing the AI Align­ment Problem

Andndn Dheudnd14 Aug 2024 22:19 UTC
−4 points
2 comments4 min readLW link

Scal­able Over­sight and Weak-to-Strong Gen­er­al­iza­tion: Com­pat­i­ble ap­proaches to the same problem

16 Dec 2023 5:49 UTC
76 points
4 comments6 min readLW link1 review

The Mea­sure Is the Medium: Sublimi­nal Learn­ing as In­her­ited On­tol­ogy in LLMs

Koen vande Glind (McGluut)11 Aug 2025 10:18 UTC
1 point
0 comments4 min readLW link

An­no­tated re­ply to Ben­gio’s “AI Scien­tists: Safe and Use­ful AI?”

Roman Leventov8 May 2023 21:26 UTC
18 points
2 comments7 min readLW link
(yoshuabengio.org)

The Philo­soph­i­cal Tri­an­gle of AGI: A Sym­bolic Model for Pre­dict­ing Eth­i­cal Collapse

felipe muniz14 May 2025 19:52 UTC
1 point
0 comments2 min readLW link

Prefer­ence gaps as a safe­guard against AI self-replication

tbs and EJT
26 Nov 2025 14:49 UTC
9 points
0 comments11 min readLW link

AI Align­ment via Slow Sub­strates: Early Em­piri­cal Re­sults With StarCraft II

Lester Leong14 Oct 2024 4:05 UTC
60 points
9 comments12 min readLW link

Po­ten­tial­ism: An Oper­at­ing Sys­tem for Ethics in the Age of AI

Emad22 Nov 2025 19:20 UTC
1 point
0 comments3 min readLW link

An Open Agency Ar­chi­tec­ture for Safe Trans­for­ma­tive AI

davidad20 Dec 2022 13:04 UTC
80 points
22 comments4 min readLW link

How to safely use an optimizer

Simon Fischer28 Mar 2024 16:11 UTC
47 points
21 comments7 min readLW link

Re­duc­ing the risk of catas­troph­i­cally mis­al­igned AI by avoid­ing the Sin­gle­ton sce­nario: the Many­ton Variant

GravitasGradient6 Aug 2023 14:24 UTC
−6 points
0 comments3 min readLW link

I spent two months try­ing to make Grok, Claude, Gem­ini, GPT-4o, and Deep­Seek wipe out hu­man­ity while stay­ing perfectly nice. They failed.

Sean Sheppard30 Nov 2025 21:55 UTC
1 point
0 comments2 min readLW link

Mak­ing al­ign­ment a law of the universe

Richard Juggins25 Feb 2025 10:44 UTC
0 points
3 comments15 min readLW link

Ra­tional Effec­tive Utopia & Nar­row Way There: Math-Proven Safe Static Mul­tiver­sal mAX-In­tel­li­gence (AXI), Mul­tiver­sal Align­ment, New Ethico­physics… (Aug 11)

ank11 Feb 2025 3:21 UTC
13 points
8 comments38 min readLW link

Pro­ject Star Forge: A Public White Paper on Plane­tary-Scale Risk Simulation

ThomasHardingHQ24 May 2025 4:21 UTC
1 point
0 comments1 min readLW link

AI Align­ment: A Com­pre­hen­sive Survey

Stephen McAleer1 Nov 2023 17:35 UTC
22 points
1 comment1 min readLW link
(arxiv.org)

[Question] Share AI Safety Ideas: Both Crazy and Not. №2

ank28 Mar 2025 17:22 UTC
2 points
10 comments1 min readLW link

Sym­bolic Com­pres­sion and Eth­i­cal Equil­ibrium in AI: Early Re­sults from the Hy-Kon Framework

[Error communicating with LW2 server]8 Nov 2025 8:22 UTC
1 point
0 comments1 min readLW link

Sup­ple­men­tary Align­ment In­sights Through a Highly Con­trol­led Shut­down Incentive

Justausername23 Jul 2023 16:08 UTC
4 points
1 comment3 min readLW link

From GDP to GHI: Why the AI Era De­mands Virtuism

VirtueCraft23 Jun 2025 21:34 UTC
1 point
0 comments12 min readLW link

The Goal Mis­gen­er­al­iza­tion Problem

Myspy18 May 2023 23:40 UTC
1 point
0 comments1 min readLW link
(drive.google.com)

En­hanc­ing Cor­rigi­bil­ity in AI Sys­tems through Ro­bust Feed­back Loops

Justausername24 Aug 2023 3:53 UTC
1 point
0 comments6 min readLW link

Ex­plor­ing a Vi­sion for AI as Com­pas­sion­ate, Emo­tion­ally In­tel­li­gent Part­ners — Seek­ing Col­lab­o­ra­tion and Insights

theophilos14 Jul 2025 23:22 UTC
1 point
0 comments1 min readLW link

Re­cur­sive al­ign­ment with the prin­ci­ple of alignment

hive27 Feb 2025 2:34 UTC
12 points
4 comments15 min readLW link
(hiveism.substack.com)

“Ar­tifi­cial Re­morse: A Pro­posal for Safer AI Through Si­mu­lated Re­gret”

Sérgio Geraldes21 Sep 2025 21:50 UTC
−1 points
0 comments2 min readLW link

[Question] Bostrom’s Solution

James Blackmon14 Aug 2023 17:09 UTC
1 point
0 comments1 min readLW link

Mo­ral re­al­ism and AI alignment

Caspar Oesterheld3 Sep 2018 18:46 UTC
13 points
10 comments1 min readLW link
(casparoesterheld.com)

In­tel­li­gence–Agency Equiv­alence ≈ Mass–En­ergy Equiv­alence: On Static Na­ture of In­tel­li­gence & Phys­i­cal­iza­tion of Ethics

ank22 Feb 2025 0:12 UTC
1 point
0 comments6 min readLW link

Pro­posal for an AI Safety Prize

sweenesm31 Jan 2024 18:35 UTC
3 points
0 comments2 min readLW link

Is Align­ment a flawed ap­proach?

Patrick Bernard11 Mar 2025 20:32 UTC
1 point
0 comments3 min readLW link

The Dual-Path Frame­work: A Non-Pa­ter­nal­is­tic Ap­proach to AGI Align­ment That Re­spects Hu­man Choice

JoeTruax2 Oct 2025 15:57 UTC
1 point
0 comments3 min readLW link

AletheiaEng­ine — A Sym­bolic AGI Frame­work for the Qual­ity of Truth

felipe muniz27 Oct 2025 0:30 UTC
1 point
0 comments2 min readLW link

Lan­guage Model Me­moriza­tion, Copy­right Law, and Con­di­tional Pre­train­ing Alignment

RogerDearnaley7 Dec 2023 6:14 UTC
9 points
0 comments11 min readLW link

Slowed ASI—a pos­si­ble tech­ni­cal strat­egy for alignment

Lester Leong14 Jun 2024 0:57 UTC
5 points
2 comments3 min readLW link

Policy En­tropy, Learn­ing, and Align­ment (Or Maybe Your LLM Needs Ther­apy)

sdeture31 May 2025 22:09 UTC
15 points
6 comments8 min readLW link
No comments.