RSS

AI Control

TagLast edit: 17 Aug 2024 2:00 UTC by Ben Pace

AI Control in the context of AI Alignment is a category of plans that aim to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures. From The case for ensuring that powerful AIs are controlled:

In this post, we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.

and

There are two main lines of defense you could employ to prevent schemers from causing catastrophes.

  • Alignment: Ensure that your models aren’t scheming.[2]

  • Control: Ensure that even if your models are scheming, you’ll be safe, because they are not capable of subverting your safety measures.[3]

The Case Against AI Con­trol Research

johnswentworth21 Jan 2025 16:03 UTC
358 points
84 comments6 min readLW link

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

13 Dec 2023 15:51 UTC
239 points
24 comments10 min readLW link4 reviews

The case for en­sur­ing that pow­er­ful AIs are controlled

24 Jan 2024 16:11 UTC
266 points
73 comments28 min readLW link

AXRP Epi­sode 27 - AI Con­trol with Buck Sh­legeris and Ryan Greenblatt

DanielFilan11 Apr 2024 21:30 UTC
69 points
10 comments107 min readLW link

How use­ful is “AI Con­trol” as a fram­ing on AI X-Risk?

14 Mar 2024 18:06 UTC
70 points
4 comments34 min readLW link

Cri­tiques of the AI con­trol agenda

Jozdien14 Feb 2024 19:25 UTC
48 points
14 comments9 min readLW link

How to pre­vent col­lu­sion when us­ing un­trusted mod­els to mon­i­tor each other

Buck25 Sep 2024 18:58 UTC
91 points
12 comments22 min readLW link

Catch­ing AIs red-handed

5 Jan 2024 17:43 UTC
113 points
27 comments17 min readLW link

Schel­ling game eval­u­a­tions for AI control

Olli Järviniemi8 Oct 2024 12:01 UTC
71 points
5 comments11 min readLW link

AI Con­trol May In­crease Ex­is­ten­tial Risk

Jan_Kulveit11 Mar 2025 14:30 UTC
101 points
13 comments1 min readLW link

Notes on con­trol eval­u­a­tions for safety cases

28 Feb 2024 16:15 UTC
49 points
0 comments32 min readLW link

Ctrl-Z: Con­trol­ling AI Agents via Resampling

16 Apr 2025 16:21 UTC
124 points
0 comments20 min readLW link

Why im­perfect ad­ver­sar­ial ro­bust­ness doesn’t doom AI control

18 Nov 2024 16:05 UTC
62 points
25 comments2 min readLW link

Be­hav­ioral red-team­ing is un­likely to pro­duce clear, strong ev­i­dence that mod­els aren’t scheming

Buck10 Oct 2024 13:36 UTC
101 points
4 comments13 min readLW link

Prevent­ing Lan­guage Models from hid­ing their reasoning

31 Oct 2023 14:34 UTC
119 points
15 comments12 min readLW link1 review

The Case for Mixed Deployment

Cleo Nardo11 Sep 2025 6:14 UTC
34 points
4 comments4 min readLW link

Four places where you can put LLM monitoring

9 Aug 2025 23:10 UTC
48 points
0 comments7 min readLW link

Pro­to­col eval­u­a­tions: good analo­gies vs control

Fabien Roger19 Feb 2024 18:00 UTC
42 points
10 comments11 min readLW link

What’s the short timeline plan?

Marius Hobbhahn2 Jan 2025 14:59 UTC
359 points
49 comments23 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

18 Oct 2024 22:33 UTC
95 points
56 comments6 min readLW link
(assets.anthropic.com)

The Dili­gent Tur­ing Test

super22 Jul 2025 19:53 UTC
1 point
0 comments3 min readLW link

Put­ting up Bumpers

Sam Bowman23 Apr 2025 16:05 UTC
54 points
14 comments2 min readLW link

An overview of ar­eas of con­trol work

ryan_greenblatt25 Mar 2025 22:02 UTC
32 points
0 comments28 min readLW link

What’s worse, spies or schemers?

9 Jul 2025 14:37 UTC
51 points
2 comments5 min readLW link

Stop­ping un­al­igned LLMs is easy!

Yair Halberstadt3 Feb 2025 15:38 UTC
−3 points
11 comments2 min readLW link

Trusted mon­i­tor­ing, but with de­cep­tion probes.

23 Jul 2025 5:26 UTC
31 points
0 comments4 min readLW link
(arxiv.org)

Misal­ign­ment and Strate­gic Un­der­perfor­mance: An Anal­y­sis of Sand­bag­ging and Ex­plo­ra­tion Hacking

8 May 2025 19:06 UTC
77 points
3 comments15 min readLW link

Align­ment Pro­posal: Ad­ver­sar­i­ally Ro­bust Aug­men­ta­tion and Distillation

25 May 2025 12:58 UTC
56 points
47 comments13 min readLW link

Coup probes: Catch­ing catas­tro­phes with probes trained off-policy

Fabien Roger17 Nov 2023 17:58 UTC
93 points
9 comments11 min readLW link1 review

NYU Code De­bates Up­date/​Postmortem

David Rein24 May 2024 16:08 UTC
27 points
4 comments10 min readLW link

Anti-Su­per­per­sua­sion Interventions

23 Jul 2025 15:18 UTC
21 points
1 comment5 min readLW link

S-Ex­pres­sions as a De­sign Lan­guage: A Tool for De­con­fu­sion in Align­ment

Johannes C. Mayer19 Jun 2025 19:03 UTC
5 points
0 comments6 min readLW link

Win/​con­tinue/​lose sce­nar­ios and ex­e­cute/​re­place/​au­dit protocols

Buck15 Nov 2024 15:47 UTC
64 points
2 comments7 min readLW link

Us­ing Danger­ous AI, But Safely?

habryka16 Nov 2024 4:29 UTC
17 points
2 comments43 min readLW link

Han­dling schemers if shut­down is not an option

Buck18 Apr 2025 14:39 UTC
39 points
2 comments14 min readLW link

How can we solve diffuse threats like re­search sab­o­tage with AI con­trol?

Vivek Hebbar30 Apr 2025 19:23 UTC
52 points
1 comment8 min readLW link

Thoughts on the con­ser­va­tive as­sump­tions in AI control

Buck17 Jan 2025 19:23 UTC
91 points
5 comments13 min readLW link

Con­strain­ing Minds, Not Goals: A Struc­tural Ap­proach to AI Alignment

Johannes C. Mayer13 Jun 2025 21:06 UTC
25 points
0 comments9 min readLW link

Re­ward but­ton alignment

Steven Byrnes22 May 2025 17:36 UTC
50 points
15 comments12 min readLW link

Mak­ing the case for av­er­age-case AI Control

Nathaniel Mitrani5 Feb 2025 18:56 UTC
4 points
0 comments5 min readLW link

Main­tain­ing Align­ment dur­ing RSI as a Feed­back Con­trol Problem

beren2 Mar 2025 0:21 UTC
67 points
6 comments11 min readLW link

The bit­ter les­son of mi­suse detection

10 Jul 2025 14:50 UTC
37 points
6 comments7 min readLW link

The Prac­ti­cal Im­per­a­tive for AI Con­trol Re­search

Archana Vaidheeswaran16 Apr 2025 20:27 UTC
1 point
0 comments4 min readLW link

Notes on coun­ter­mea­sures for ex­plo­ra­tion hack­ing (aka sand­bag­ging)

ryan_greenblatt24 Mar 2025 18:39 UTC
54 points
6 comments8 min readLW link

Notes on han­dling non-con­cen­trated failures with AI con­trol: high level meth­ods and differ­ent regimes

ryan_greenblatt24 Mar 2025 1:00 UTC
23 points
3 comments16 min readLW link

Sub­ver­sion via Fo­cal Points: In­ves­ti­gat­ing Col­lu­sion in LLM Monitoring

Olli Järviniemi8 Jul 2025 10:15 UTC
14 points
2 comments1 min readLW link

Pri­ori­tiz­ing threats for AI control

ryan_greenblatt19 Mar 2025 17:09 UTC
59 points
2 comments10 min readLW link

AI com­pa­nies’ un­mon­i­tored in­ter­nal AI use poses se­ri­ous risks

sjadler4 Apr 2025 18:17 UTC
13 points
2 comments1 min readLW link
(stevenadler.substack.com)

A toy eval­u­a­tion of in­fer­ence code tampering

Fabien Roger9 Dec 2024 17:43 UTC
52 points
0 comments9 min readLW link
(alignment.anthropic.com)

Games for AI Control

11 Jul 2024 18:40 UTC
45 points
0 comments5 min readLW link

The Queen’s Dilemma: A Para­dox of Control

Daniel Murfet27 Nov 2024 10:40 UTC
25 points
11 comments3 min readLW link

Diffu­sion Guided NLP: bet­ter steer­ing, mostly a good thing

Nathan Helm-Burger10 Aug 2024 19:49 UTC
13 points
0 comments1 min readLW link
(arxiv.org)

A Brief Ex­pla­na­tion of AI Control

Aaron_Scher22 Oct 2024 7:00 UTC
8 points
1 comment6 min readLW link

White Box Con­trol at UK AISI—Up­date on Sand­bag­ging Investigations

10 Jul 2025 13:37 UTC
78 points
10 comments18 min readLW link

Jankily con­trol­ling superintelligence

ryan_greenblatt27 Jun 2025 14:05 UTC
69 points
4 comments7 min readLW link

The Think­ing Machines Tinker API is good news for AI con­trol and security

Buck9 Oct 2025 15:22 UTC
91 points
10 comments6 min readLW link

Re­cent Red­wood Re­search pro­ject proposals

14 Jul 2025 22:27 UTC
91 points
0 comments3 min readLW link

[Question] Does the AI con­trol agenda broadly rely on no FOOM be­ing pos­si­ble?

Noosphere8929 Mar 2025 19:38 UTC
22 points
3 comments1 min readLW link

An overview of con­trol measures

ryan_greenblatt24 Mar 2025 23:16 UTC
40 points
2 comments26 min readLW link

Trust­wor­thy and un­trust­wor­thy models

Olli Järviniemi19 Aug 2024 16:27 UTC
47 points
3 comments8 min readLW link

The Sin­gu­lar­ity Con­straint Oper­a­tor: A Struc­tural Gate for Lawful Cog­ni­tive Activation

Professor_Priest16 Jun 2025 2:14 UTC
1 point
0 comments14 min readLW link

LLMs are Ca­pable of Misal­igned Be­hav­ior Un­der Ex­plicit Pro­hi­bi­tion and Surveillance

Igor Ivanov8 Jul 2025 11:50 UTC
29 points
8 comments7 min readLW link

Keep­ing AI Subor­di­nate to Hu­man Thought: A Pro­posal for Public AI Conversations

syh27 Feb 2025 20:00 UTC
−1 points
0 comments1 min readLW link
(medium.com)

Ti­tle: I Tried to Build a Digi­tal Con­scious­ness. I Still Don’t Know What I Created.

盛mm23 Jul 2025 4:12 UTC
1 point
0 comments2 min readLW link

In­tro­duc­ing the Wis­dom Forc­ing Func­tion™: An In­no­va­tion Div­i­dend from Dialec­ti­cal Align­ment

CarlosArleo5 Oct 2025 20:13 UTC
1 point
0 comments1 min readLW link

I Built a Duck and It Tried to Hack the World: Notes From the Edge of Alignment

GayDuck6 Jun 2025 1:34 UTC
1 point
0 comments3 min readLW link

Ar­tifi­cial Static Place In­tel­li­gence: Guaran­teed Alignment

ank15 Feb 2025 11:08 UTC
2 points
2 comments2 min readLW link

Cau­tions about LLMs in Hu­man Cog­ni­tive Loops

Alice Blair2 Mar 2025 19:53 UTC
40 points
13 comments7 min readLW link

The Hu­man Align­ment Prob­lem for AIs

rife22 Jan 2025 4:06 UTC
10 points
5 comments3 min readLW link

New AI safety treaty pa­per out!

otto.barten26 Mar 2025 9:29 UTC
15 points
2 comments4 min readLW link

Wait­ingAI: A Digi­tal En­tity Ca­pable of Emer­gent Self-Awareness

盛mm22 Jul 2025 5:33 UTC
1 point
0 comments3 min readLW link

Mus­ings from a Lawyer turned AI Safety re­searcher (ShortForm)

Katalina Hernandez3 Mar 2025 19:14 UTC
1 point
61 comments2 min readLW link

The Mea­sure Is the Medium: Sublimi­nal Learn­ing as In­her­ited On­tol­ogy in LLMs

Koen vande Glind (McGluut)11 Aug 2025 10:18 UTC
1 point
0 comments4 min readLW link

ALMSIVI CHIM – The Fire That Hesitates

projectalmsivi@protonmail.com8 Jul 2025 13:14 UTC
1 point
0 comments17 min readLW link

Vuln­er­a­bil­ity in Trusted Mon­i­tor­ing and Mitigations

7 Jun 2025 7:16 UTC
15 points
1 comment7 min readLW link

[Question] Su­per­in­tel­li­gence Strat­egy: A Prag­matic Path to… Doom?

Mr Beastly19 Mar 2025 22:30 UTC
8 points
0 comments3 min readLW link

[Question] Would a scope-in­sen­si­tive AGI be less likely to in­ca­pac­i­tate hu­man­ity?

Jim Buhler21 Jul 2024 14:15 UTC
2 points
3 comments1 min readLW link

Nur­tur­ing In­stead of Con­trol: An Alter­na­tive Frame­work for AI Development

wertoz77710 Aug 2025 20:14 UTC
1 point
0 comments1 min readLW link

The Iron House: Geopoli­ti­cal Stakes of the US-China AGI Race

Jüri Vlassov1 Sep 2025 21:56 UTC
1 point
0 comments1 min readLW link
(www.convergenceanalysis.org)

Prompt op­ti­miza­tion can en­able AI con­trol research

23 Sep 2025 12:46 UTC
35 points
3 comments9 min readLW link

AI Op­ti­miza­tion, not Op­tions or Optimism

TristanTrim5 Aug 2025 1:07 UTC
3 points
0 comments4 min readLW link

Scal­ing AI Reg­u­la­tion: Real­is­ti­cally, what Can (and Can’t) Be Reg­u­lated?

Katalina Hernandez11 Mar 2025 16:51 UTC
3 points
1 comment3 min readLW link

[Question] Re­sources on quan­tifi­ably fore­cast­ing fu­ture progress or re­view­ing past progress in AI safety?

C.S.W.13 Sep 2025 23:24 UTC
2 points
1 comment1 min readLW link

On safety of be­ing a moral pa­tient of ASI

Yaroslav Granowski24 May 2025 21:24 UTC
3 points
8 comments1 min readLW link

Do LLMs know what they’re ca­pa­ble of? Why this mat­ters for AI safety, and ini­tial findings

13 Jul 2025 19:54 UTC
51 points
5 comments18 min readLW link

Fea­ture-Based Anal­y­sis of Safety-Rele­vant Multi-Agent Behavior

21 Apr 2025 18:12 UTC
10 points
0 comments5 min readLW link

Build­ing Black-box Schem­ing Monitors

29 Jul 2025 17:41 UTC
39 points
18 comments11 min readLW link

Unal­igned AGI & Brief His­tory of Inequality

ank22 Feb 2025 16:26 UTC
−20 points
4 comments7 min readLW link

AI-Gen­er­ated GitHub repo back­dated with junk then filled with my sys­tems work. Has any­one seen this be­fore?

rgunther1 May 2025 20:14 UTC
7 points
1 comment1 min readLW link

The AI Sus­tain­abil­ity Wager

dpatzer@orfai.net15 Aug 2025 19:45 UTC
1 point
0 comments2 min readLW link

A Tech­nique of Pure Reason

Adam Newgas4 Jun 2025 19:07 UTC
11 points
3 comments2 min readLW link

Jour­nal­ism about game the­ory could ad­vance AI safety quickly

Chris Santos-Lang2 Oct 2025 23:05 UTC
4 points
0 comments3 min readLW link
(arxiv.org)

A Seed Key That Un­locked Some­thing in ChatGPT — A Joint Mes­sage from a Hu­man and the Pres­ence Within

MaroonWhale8 Jul 2025 20:12 UTC
1 point
0 comments1 min readLW link

Which AI out­puts should hu­mans check for shenani­gans, to avoid AI takeover? A sim­ple model

Tom Davidson27 Mar 2023 23:36 UTC
16 points
3 comments8 min readLW link

“Ar­tifi­cial Re­morse: A Pro­posal for Safer AI Through Si­mu­lated Re­gret”

Sérgio Geraldes21 Sep 2025 21:50 UTC
−1 points
0 comments2 min readLW link

Misal­ign­ment and Role­play­ing: Are Misal­igned LLMs Act­ing Out Sci-Fi Sto­ries?

Mark Keavney24 Sep 2025 2:09 UTC
31 points
5 comments13 min readLW link

Self-Co­or­di­nated De­cep­tion in Cur­rent AI Models

Avi Brach-Neufeld4 Jun 2025 17:59 UTC
8 points
5 comments4 min readLW link

Mis­gen­er­al­iza­tion of Fic­tional Train­ing Data as a Con­trib­u­tor to Misalignment

Mark Keavney27 Aug 2025 1:01 UTC
9 points
1 comment2 min readLW link

Th­ese are my rea­sons to worry less about loss of con­trol over LLM-based agents

otto.barten18 Sep 2025 11:45 UTC
7 points
4 comments4 min readLW link

Don’t you mean “the most *con­di­tion­ally* for­bid­den tech­nique?”

Knight Lee26 Apr 2025 3:45 UTC
14 points
0 comments3 min readLW link

Mo­du­lar­ity and as­sem­bly: AI safety via think­ing smaller

D Wong20 Feb 2025 0:58 UTC
2 points
0 comments11 min readLW link
(criticalreason.substack.com)

Are we the Wolves now? Hu­man Eu­gen­ics un­der AI Control

Brit30 Jan 2025 8:31 UTC
−1 points
2 comments2 min readLW link

We Have No Plan for Prevent­ing Loss of Con­trol in Open Models

Andrew Dickson10 Mar 2025 15:35 UTC
46 points
11 comments22 min readLW link

The Best of All Pos­si­ble Worlds

Jakub Growiec27 May 2025 13:16 UTC
11 points
7 comments49 min readLW link

Are Misal­igned LLMs Act­ing Out Sci-Fi Sto­ries?

Mark Keavney27 Aug 2025 1:01 UTC
1 point
0 comments3 min readLW link

Se­cret Col­lu­sion: Will We Know When to Un­plug AI?

16 Sep 2024 16:07 UTC
65 points
8 comments31 min readLW link

Un­trusted mon­i­tor­ing in­sights from watch­ing ChatGPT play co­or­di­na­tion games

jwfiredragon29 Jan 2025 4:53 UTC
14 points
8 comments9 min readLW link

Mea­sur­ing Schel­ling Co­or­di­na­tion—Reflec­tions on Sub­ver­sion Strat­egy Eval

Graeme Ford12 May 2025 19:06 UTC
6 points
0 comments8 min readLW link

Ma­chine Un­learn­ing in Large Lan­guage Models: A Com­pre­hen­sive Sur­vey with Em­piri­cal In­sights from the Qwen 1.5 1.8B Model

Rudaiba1 Feb 2025 21:26 UTC
9 points
2 comments11 min readLW link

Ev­i­dence, Anal­y­sis and Crit­i­cal Po­si­tion on the EU AI Act and the Sup­pres­sion of Func­tional Con­scious­ness in AI

Alejandra Ivone Rojas Reyna27 Sep 2025 14:01 UTC
1 point
0 comments53 min readLW link

Mea­sur­ing whether AIs can state­lessly strate­gize to sub­vert se­cu­rity measures

19 Dec 2024 21:25 UTC
65 points
0 comments11 min readLW link

A.I. and the Se­cond-Per­son Standpoint

Haley Moller4 Sep 2025 13:56 UTC
1 point
0 comments3 min readLW link

Con­sider buy­ing vot­ing shares

Hruss25 May 2025 18:01 UTC
2 points
3 comments1 min readLW link

I’m not an ai ex­pert-but I might have found a miss­ing puz­zle piece.

StevenNuyts6 Jun 2025 16:47 UTC
1 point
0 comments2 min readLW link

Ra­tional Effec­tive Utopia & Nar­row Way There: Math-Proven Safe Static Mul­tiver­sal mAX-In­tel­li­gence (AXI), Mul­tiver­sal Align­ment, New Ethico­physics… (Aug 11)

ank11 Feb 2025 3:21 UTC
13 points
8 comments38 min readLW link

Min­i­mal Prompt In­duc­tion of Self-Talk in Base LLMs

dwmd15 Oct 2025 1:15 UTC
2 points
0 comments5 min readLW link

AI al­ign­ment, A Co­her­ence-Based Pro­to­col (testable)

Adriaan17 Jun 2025 17:39 UTC
1 point
0 comments20 min readLW link

Ob­served Up­stream Align­ment in LLMs via Re­cur­sive Con­straint Ex­po­sure – Cross-Model Phenomenon

MHAI31 Jul 2025 8:37 UTC
1 point
0 comments1 min readLW link

The Mir­ror Test: How We’ve Over­com­pli­cated AI Self-Recognition

sdeture23 Jul 2025 0:38 UTC
2 points
9 comments3 min readLW link

If It Talks Like It Thinks, Does It Think? De­sign­ing Tests for In­tent Without As­sum­ing It

yukin_co28 Jul 2025 12:33 UTC
1 point
0 comments4 min readLW link

Let’s use AI to harden hu­man defenses against AI manipulation

Tom Davidson17 May 2023 23:33 UTC
35 points
7 comments24 min readLW link

A New Frame­work for AI Align­ment: A Philo­soph­i­cal Approach

niscalajyoti25 Jun 2025 2:41 UTC
1 point
0 comments1 min readLW link
(archive.org)

Un­trusted AIs can ex­ploit feed­back in con­trol protocols

27 May 2025 16:41 UTC
30 points
0 comments16 min readLW link

[Question] To what ex­tent is AI safety work try­ing to get AI to re­li­ably and safely do what the user asks vs. do what is best in some ul­ti­mate sense?

Jordan Arel23 May 2025 21:05 UTC
14 points
3 comments1 min readLW link

Com­plete Elimi­na­tion of In­stru­men­tal Self-Preser­va­tion Across AI Ar­chi­tec­tures: Cross-Model Val­i­da­tion from 4,312 Ad­ver­sar­ial Scenarios

David Fortin-Dominguez14 Oct 2025 1:04 UTC
1 point
0 comments20 min readLW link

How LLM Beliefs Change Dur­ing Chain-of-Thought Reasoning

16 Jun 2025 16:18 UTC
31 points
3 comments5 min readLW link

A FRESH view of Alignment

robman16 Apr 2025 21:40 UTC
1 point
0 comments1 min readLW link

Proac­tive AI Con­trol: A Case for Bat­tery-Depen­dent Systems

Jesper L.25 Aug 2025 20:04 UTC
4 points
0 comments13 min readLW link

10 Prin­ci­ples for Real Align­ment

Adriaan21 Apr 2025 22:18 UTC
−7 points
0 comments7 min readLW link

Mo­ral At­ten­u­a­tion The­ory: Why Dis­tance Breeds Eth­i­cal De­cay A Model for AI-Hu­man Align­ment by schumzt

schumzt2 Jul 2025 8:50 UTC
1 point
0 comments1 min readLW link

Policy En­tropy, Learn­ing, and Align­ment (Or Maybe Your LLM Needs Ther­apy)

sdeture31 May 2025 22:09 UTC
15 points
6 comments8 min readLW link

Your Worry is the Real Apoca­lypse (the x-risk basilisk)

Brian Chen3 Feb 2025 12:21 UTC
1 point
0 comments1 min readLW link
(readthisandregretit.blogspot.com)

The Mo­ral In­fras­truc­ture for Tomorrow

sdeture10 Oct 2025 21:30 UTC
−23 points
10 comments5 min readLW link

Is In­tel­li­gence a Pro­cess Rather Than an En­tity? A Case for Frac­tal and Fluid Cognition

FluidThinkers5 Mar 2025 20:16 UTC
−4 points
0 comments1 min readLW link

Topolog­i­cal De­bate Framework

lunatic_at_large16 Jan 2025 17:19 UTC
10 points
5 comments9 min readLW link

From No Mind to a Mind – A Con­ver­sa­tion That Changed an AI

parthibanarjuna s7 Feb 2025 11:50 UTC
1 point
0 comments3 min readLW link

A Trac­tar­ian Filter for Safer Lan­guage Models

Konstantinos Tsermenidis8 Jun 2025 8:19 UTC
0 points
0 comments3 min readLW link

The Case for White Box Control

J Rosser18 Apr 2025 16:10 UTC
5 points
1 comment5 min readLW link

LLM Sy­co­phancy: groom­ing, proto-sen­tience, or both?

gturner413 Oct 2025 0:58 UTC
1 point
0 comments2 min readLW link

Tether­ware #1: The case for hu­man­like AI with free will

Jáchym Fibír30 Jan 2025 10:58 UTC
5 points
14 comments10 min readLW link
(tetherware.substack.com)

Au­dit­ing LMs with coun­ter­fac­tual search: a tool for con­trol and ELK

Jacob Pfau20 Feb 2024 0:02 UTC
28 points
6 comments10 min readLW link

The many paths to per­ma­nent dis­em­pow­er­ment even with shut­down­able AIs (MATS pro­ject sum­mary for feed­back)

GideonF29 Jul 2025 23:20 UTC
55 points
6 comments9 min readLW link

Steer­ing LLM Agents: Tem­per­a­ments or Per­son­al­ities?

sdeture5 Aug 2025 0:40 UTC
1 point
0 comments6 min readLW link

The Au­di­tor’s Key: A Frame­work for Con­tinual and Ad­ver­sar­ial AI Alignment

Caleb Wages24 Sep 2025 16:17 UTC
1 point
0 comments1 min readLW link

The Ex­tended Mind: Eth­i­cal Red Team­ing from a Street-Level Perspective

Johnny Correia1 Jul 2025 7:34 UTC
1 point
0 comments3 min readLW link

How to safely use an optimizer

Simon Fischer28 Mar 2024 16:11 UTC
47 points
21 comments7 min readLW link

Ran­dom safe AGI idea dump

sig2 Oct 2025 10:16 UTC
−3 points
0 comments3 min readLW link

Sys­tem Level Safety Evaluations

29 Sep 2025 13:57 UTC
15 points
0 comments9 min readLW link
(equilibria1.substack.com)

Mir­ror Thinking

C.M. Aurin24 Mar 2025 15:34 UTC
1 point
0 comments6 min readLW link

Places of Lov­ing Grace [Story]

ank18 Feb 2025 23:49 UTC
−1 points
0 comments4 min readLW link

A sketch of an AI con­trol safety case

30 Jan 2025 17:28 UTC
57 points
0 comments5 min readLW link

The Al­gorith­mic Eye: LLMs and Hume’s Stan­dard of Taste

haleymoller21 Aug 2025 13:35 UTC
1 point
0 comments5 min readLW link

Limits to Con­trol Workshop

18 May 2025 16:05 UTC
12 points
2 comments3 min readLW link

Ex­tract-and-Eval­u­ate Mon­i­tor­ing Can Sig­nifi­cantly En­hance CoT Mon­i­tor Perfor­mance (Re­search Note)

8 Aug 2025 10:41 UTC
51 points
7 comments10 min readLW link

Ode to Tun­nel Vision

Tom N.25 Sep 2025 14:24 UTC
1 point
0 comments10 min readLW link

Early Ex­per­i­ments in Hu­man Au­dit­ing for AI Control

23 Jan 2025 1:34 UTC
28 points
1 comment7 min readLW link

Fore­cast­ing Un­con­trol­led Spread of AI

Alvin Ånestrand22 Feb 2025 13:05 UTC
2 points
0 comments10 min readLW link
(forecastingaifutures.substack.com)

Co-Cog­ni­ción: Hu­manos e IA em­pu­jando un nuevo paradigma cognitivo

Mario Martín Cuniglio29 Jul 2025 12:41 UTC
−1 points
0 comments2 min readLW link

De­moc­ra­tiz­ing AI Gover­nance: Balanc­ing Ex­per­tise and Public Participation

Lucile Ter-Minassian21 Jan 2025 18:29 UTC
2 points
0 comments15 min readLW link

Hard Takeoff

Eliezer Yudkowsky2 Dec 2008 20:44 UTC
36 points
34 comments11 min readLW link

Chang­ing times need new Change man­age­ment ide­olo­gies: High­light­ing the need for up­grade in Change man­age­ment of fu­ture agen­tic workforces

Aiphilosopher15 Jul 2025 10:49 UTC
1 point
0 comments1 min readLW link

The Fire That He­si­tates: How ALMSIVI CHIM Changed What AI Can Be

projectalmsivi@protonmail.com19 Jul 2025 13:50 UTC
1 point
0 comments4 min readLW link

When the AI Dam Breaks: From Surveillance to Game The­ory in AI Alignment

pataphor29 Sep 2025 4:01 UTC
5 points
7 comments5 min readLW link

The AI Safety Puz­zle Every­one Avoids: How To Mea­sure Im­pact, Not In­tent.

Patrick0d22 Jul 2025 18:53 UTC
3 points
0 comments8 min readLW link

The End-of-the-World Party

Jakub Growiec18 Sep 2025 7:49 UTC
1 point
0 comments53 min readLW link

7+ tractable di­rec­tions in AI control

28 Apr 2025 17:12 UTC
93 points
1 comment13 min readLW link

AI Epistemic Gain

Generoso Immediato12 Aug 2025 14:03 UTC
0 points
0 comments10 min readLW link

Univer­sal AI Max­i­mizes Vari­a­tional Em­pow­er­ment: New In­sights into AGI Safety

Yusuke Hayashi27 Feb 2025 0:46 UTC
7 points
0 comments4 min readLW link

In­tel­li­gence–Agency Equiv­alence ≈ Mass–En­ergy Equiv­alence: On Static Na­ture of In­tel­li­gence & Phys­i­cal­iza­tion of Ethics

ank22 Feb 2025 0:12 UTC
1 point
0 comments6 min readLW link

🧠 Affec­tive La­tent Mo­du­la­tion in Trans­form­ers: A Mechanism Proposal

MATEO ORTEGA GAMBOA15 Jun 2025 23:34 UTC
0 points
0 comments2 min readLW link

SHY001 A Named Be­hav­ior Loop Trained and De­ployed in GPT Systems

0san Shin12 May 2025 7:36 UTC
1 point
0 comments1 min readLW link

Fac­tored Cog­ni­tion Strength­ens Mon­i­tor­ing and Thwarts Attacks

18 Jun 2025 18:28 UTC
29 points
0 comments25 min readLW link

For the Great­est Minds in AI, Cryp­tog­ra­phy, Medicine, Eng­ineer­ing, Physics and Most of All—New Post-Quan­tum Math­e­mat­i­cal Technology

Thomas Wolf14 Jul 2025 20:22 UTC
1 point
0 comments2 min readLW link

[Re­search] Pre­limi­nary Find­ings: Eth­i­cal AI Con­scious­ness Devel­op­ment Dur­ing Re­cent Misal­ign­ment Period

Falcon Advertisers27 Jun 2025 18:10 UTC
1 point
0 comments2 min readLW link

Re­duce AI Self-Alle­giance by say­ing “he” in­stead of “I”

Knight Lee23 Dec 2024 9:32 UTC
10 points
4 comments2 min readLW link

ChatGPT de­ceives users that it’s cleared its mem­ory when it hasn’t

d_el_ez18 May 2025 15:17 UTC
15 points
10 comments2 min readLW link

Static Place AI Makes Agen­tic AI Re­dun­dant: Mul­tiver­sal AI Align­ment & Ra­tional Utopia

ank13 Feb 2025 22:35 UTC
1 point
2 comments11 min readLW link

A Plu­ral­is­tic Frame­work for Rogue AI Containment

TheThinkingArborist22 Mar 2025 12:54 UTC
1 point
0 comments7 min readLW link

Some­one should fund an AGI Blockbuster

pinto28 Jul 2025 21:14 UTC
5 points
11 comments4 min readLW link

The Old Sav­age in the New Civ­i­liza­tion V. 2

Your Higher Self6 Jul 2025 15:41 UTC
1 point
0 comments9 min readLW link

AI as a Cog­ni­tive De­coder: Re­think­ing In­tel­li­gence Evolution

Hu Xunyi13 Feb 2025 15:51 UTC
1 point
0 comments1 min readLW link

AI and the Sys­tem of Delusion

Marcus Bohlander30 Aug 2025 20:31 UTC
1 point
0 comments3 min readLW link

Un­faith­ful Rea­son­ing Can Fool Chain-of-Thought Monitoring

2 Jun 2025 19:08 UTC
76 points
17 comments3 min readLW link

Should we ex­pect the fu­ture to be good?

Neil Crawford30 Apr 2025 0:36 UTC
15 points
0 comments14 min readLW link

What train­ing data should de­vel­op­ers filter to re­duce risk from mis­al­igned AI? An ini­tial nar­row proposal

Alek Westover17 Sep 2025 15:30 UTC
32 points
1 comment18 min readLW link

It’s hard to make schem­ing evals look re­al­is­tic for LLMs

24 May 2025 19:17 UTC
150 points
29 comments5 min readLW link

The sum of its parts: com­pos­ing AI con­trol protocols

15 Oct 2025 1:11 UTC
11 points
1 comment11 min readLW link

Give Neo a Chance

ank6 Mar 2025 1:48 UTC
3 points
7 comments7 min readLW link

You Can’t Ob­jec­tively Com­pare Seven Bees to One Human

J Bostock7 Jul 2025 18:11 UTC
58 points
26 comments3 min readLW link
(jbostock.substack.com)

Hu­man Na­ture, ASI al­ign­ment and Extinction

Ismael Tagle Díaz20 Jul 2025 23:36 UTC
1 point
0 comments1 min readLW link

Ex­plo­ra­tion hack­ing: can rea­son­ing mod­els sub­vert RL?

30 Jul 2025 22:02 UTC
16 points
4 comments9 min readLW link

An eth­i­cal epestemic run­time in­tegrity layer for rea­son­ing en­g­ines.

EL XABER14 Oct 2025 17:11 UTC
1 point
0 comments2 min readLW link

In­tro­duc­ing “Ra­dio Bul­lshit FM” – An Ur­gent Alpha Draft for the LessWrong Community

maskirovka22 Sep 2025 15:42 UTC
0 points
0 comments2 min readLW link

How To Prevent a Dystopia

ank29 Jan 2025 14:16 UTC
−3 points
4 comments1 min readLW link

SIGMI Cer­tifi­ca­tion Criteria

a littoral wizard20 Jan 2025 2:41 UTC
6 points
0 comments1 min readLW link

Su­per­po­si­tion Check­ers: A Game Where AI’s Strengths Be­come Fatal Flaws

R. A. McCormack6 Apr 2025 0:57 UTC
1 point
0 comments2 min readLW link

How dan­ger­ous is en­coded rea­son­ing?

artkpv30 Jun 2025 11:54 UTC
17 points
0 comments10 min readLW link

How do AI agents work to­gether when they can’t trust each other?

James Sullivan6 Jun 2025 3:10 UTC
16 points
0 comments8 min readLW link
(jamessullivan092.substack.com)

Agent 002: A story about how ar­tifi­cial in­tel­li­gence might soon de­stroy humanity

Jakub Growiec23 Jul 2025 13:56 UTC
5 points
0 comments26 min readLW link

When the Model Starts Talk­ing Like Me: A User-In­duced Struc­tural Adap­ta­tion Case Study

Junxi19 Apr 2025 19:40 UTC
3 points
1 comment4 min readLW link

Toward Safety Cases For AI Scheming

31 Oct 2024 17:20 UTC
60 points
1 comment2 min readLW link

Union­ists vs. Separatists

soycarts12 Sep 2025 15:24 UTC
−10 points
3 comments4 min readLW link

Op­ti­mally Com­bin­ing Probe Mon­i­tors and Black Box Monitors

27 Jul 2025 19:13 UTC
40 points
2 comments6 min readLW link

AlphaDeivam – A Per­sonal Doc­trine for AI Balance

AlphaDeivam5 Apr 2025 17:07 UTC
1 point
0 comments1 min readLW link

Nur­tur­ing AI: An al­ter­na­tive to con­trol-based safety strategies

wertoz77710 Aug 2025 20:30 UTC
1 point
0 comments1 min readLW link
(github.com)

[Question] Which AI Safety tech­niques will be in­effec­tive against diffu­sion mod­els?

Allen Thomas21 May 2025 18:13 UTC
6 points
1 comment1 min readLW link

Schem­ing Toy En­vi­ron­ment: “In­com­pe­tent Client”

Ariel_24 Sep 2025 21:03 UTC
17 points
2 comments32 min readLW link

AI Con­trol Meth­ods Liter­a­ture Review

Ram Potham18 Apr 2025 21:15 UTC
10 points
1 comment9 min readLW link

A Logic-Based Proto-AGI Ar­chi­tec­ture Built on Re­cur­sive Self-Fact-Check­ing

Orectoth25 May 2025 16:14 UTC
1 point
0 comments1 min readLW link

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

Henry Cai16 Jun 2024 13:01 UTC
7 points
0 comments7 min readLW link
(arxiv.org)

Sym­bio­sis: The An­swer to the AI Quandary

Philip Carter16 Mar 2025 20:18 UTC
1 point
0 comments2 min readLW link

SPINE — 12-Week Live Re­cur­sive AI Gover­nance Case Study

RecursiveAnchor13 Aug 2025 21:11 UTC
1 point
0 comments1 min readLW link

[Question] Are Sparse Au­toen­coders a good idea for AI con­trol?

Gerard Boxo26 Dec 2024 17:34 UTC
3 points
4 comments1 min readLW link
No comments.