RSS

AI Control

TagLast edit: 17 Aug 2024 2:00 UTC by Ben Pace

AI Control in the context of AI Alignment is a category of plans that aim to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures. From The case for ensuring that powerful AIs are controlled:

In this post, we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.

and

There are two main lines of defense you could employ to prevent schemers from causing catastrophes.

  • Alignment: Ensure that your models aren’t scheming.[2]

  • Control: Ensure that even if your models are scheming, you’ll be safe, because they are not capable of subverting your safety measures.[3]

The Case Against AI Con­trol Research

johnswentworth21 Jan 2025 16:03 UTC
366 points
84 comments6 min readLW link

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

13 Dec 2023 15:51 UTC
239 points
24 comments10 min readLW link4 reviews

The case for en­sur­ing that pow­er­ful AIs are controlled

24 Jan 2024 16:11 UTC
267 points
73 comments28 min readLW link

AXRP Epi­sode 27 - AI Con­trol with Buck Sh­legeris and Ryan Greenblatt

DanielFilan11 Apr 2024 21:30 UTC
69 points
10 comments107 min readLW link

How use­ful is “AI Con­trol” as a fram­ing on AI X-Risk?

14 Mar 2024 18:06 UTC
70 points
4 comments34 min readLW link

Cri­tiques of the AI con­trol agenda

Jozdien14 Feb 2024 19:25 UTC
48 points
14 comments9 min readLW link

How to pre­vent col­lu­sion when us­ing un­trusted mod­els to mon­i­tor each other

Buck25 Sep 2024 18:58 UTC
91 points
12 comments22 min readLW link

Catch­ing AIs red-handed

5 Jan 2024 17:43 UTC
113 points
27 comments17 min readLW link

Schel­ling game eval­u­a­tions for AI control

Olli Järviniemi8 Oct 2024 12:01 UTC
71 points
5 comments11 min readLW link

AI Con­trol May In­crease Ex­is­ten­tial Risk

Jan_Kulveit11 Mar 2025 14:30 UTC
101 points
13 comments1 min readLW link

Notes on con­trol eval­u­a­tions for safety cases

28 Feb 2024 16:15 UTC
49 points
0 comments32 min readLW link

Ctrl-Z: Con­trol­ling AI Agents via Resampling

16 Apr 2025 16:21 UTC
124 points
0 comments20 min readLW link

Why im­perfect ad­ver­sar­ial ro­bust­ness doesn’t doom AI control

18 Nov 2024 16:05 UTC
62 points
25 comments2 min readLW link

Be­hav­ioral red-team­ing is un­likely to pro­duce clear, strong ev­i­dence that mod­els aren’t scheming

Buck10 Oct 2024 13:36 UTC
101 points
4 comments13 min readLW link

Prevent­ing Lan­guage Models from hid­ing their reasoning

31 Oct 2023 14:34 UTC
120 points
15 comments12 min readLW link1 review

The Case for Mixed Deployment

Cleo Nardo11 Sep 2025 6:14 UTC
41 points
4 comments4 min readLW link

Four places where you can put LLM monitoring

9 Aug 2025 23:10 UTC
49 points
0 comments7 min readLW link

Pro­to­col eval­u­a­tions: good analo­gies vs control

Fabien Roger19 Feb 2024 18:00 UTC
42 points
10 comments11 min readLW link

What’s the short timeline plan?

Marius Hobbhahn2 Jan 2025 14:59 UTC
361 points
51 comments23 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

18 Oct 2024 22:33 UTC
95 points
56 comments6 min readLW link
(assets.anthropic.com)

The Dili­gent Tur­ing Test

super22 Jul 2025 19:53 UTC
1 point
0 comments3 min readLW link

Put­ting up Bumpers

Sam Bowman23 Apr 2025 16:05 UTC
55 points
14 comments2 min readLW link

AI Safety Interventions

Gunnar_Zarncke24 Nov 2025 22:28 UTC
28 points
0 comments10 min readLW link

An overview of ar­eas of con­trol work

ryan_greenblatt25 Mar 2025 22:02 UTC
32 points
0 comments28 min readLW link

What’s worse, spies or schemers?

9 Jul 2025 14:37 UTC
51 points
2 comments5 min readLW link

Stop­ping un­al­igned LLMs is easy!

Yair Halberstadt3 Feb 2025 15:38 UTC
−3 points
11 comments2 min readLW link

On Flesh­ling Safety: A De­bate by Klurl and Tra­pau­cius.

Eliezer Yudkowsky26 Oct 2025 23:44 UTC
253 points
52 comments79 min readLW link

Trusted mon­i­tor­ing, but with de­cep­tion probes.

23 Jul 2025 5:26 UTC
31 points
0 comments4 min readLW link
(arxiv.org)

Misal­ign­ment and Strate­gic Un­der­perfor­mance: An Anal­y­sis of Sand­bag­ging and Ex­plo­ra­tion Hacking

8 May 2025 19:06 UTC
78 points
3 comments15 min readLW link

Align­ment Pro­posal: Ad­ver­sar­i­ally Ro­bust Aug­men­ta­tion and Distillation

25 May 2025 12:58 UTC
56 points
47 comments13 min readLW link

Coup probes: Catch­ing catas­tro­phes with probes trained off-policy

Fabien Roger17 Nov 2023 17:58 UTC
93 points
9 comments11 min readLW link1 review

NYU Code De­bates Up­date/​Postmortem

David Rein24 May 2024 16:08 UTC
27 points
4 comments10 min readLW link

Anti-Su­per­per­sua­sion Interventions

23 Jul 2025 15:18 UTC
21 points
1 comment5 min readLW link

S-Ex­pres­sions as a De­sign Lan­guage: A Tool for De­con­fu­sion in Align­ment

Johannes C. Mayer19 Jun 2025 19:03 UTC
5 points
0 comments6 min readLW link

Win/​con­tinue/​lose sce­nar­ios and ex­e­cute/​re­place/​au­dit protocols

Buck15 Nov 2024 15:47 UTC
64 points
2 comments7 min readLW link

Us­ing Danger­ous AI, But Safely?

habryka16 Nov 2024 4:29 UTC
17 points
2 comments43 min readLW link

Han­dling schemers if shut­down is not an option

Buck18 Apr 2025 14:39 UTC
39 points
2 comments14 min readLW link

How can we solve diffuse threats like re­search sab­o­tage with AI con­trol?

Vivek Hebbar30 Apr 2025 19:23 UTC
52 points
1 comment8 min readLW link

Thoughts on the con­ser­va­tive as­sump­tions in AI control

Buck17 Jan 2025 19:23 UTC
91 points
5 comments13 min readLW link

Con­strain­ing Minds, Not Goals: A Struc­tural Ap­proach to AI Alignment

Johannes C. Mayer13 Jun 2025 21:06 UTC
25 points
0 comments9 min readLW link

Re­ward but­ton alignment

Steven Byrnes22 May 2025 17:36 UTC
50 points
15 comments12 min readLW link

Mak­ing the case for av­er­age-case AI Control

Nathaniel Mitrani5 Feb 2025 18:56 UTC
5 points
0 comments5 min readLW link

For­mal con­fin­ment prototype

Quinn24 Nov 2025 12:57 UTC
8 points
0 comments1 min readLW link
(github.com)

Please Mea­sure Ver­ifi­ca­tion Burden

Quinn23 Nov 2025 17:25 UTC
16 points
4 comments4 min readLW link

Main­tain­ing Align­ment dur­ing RSI as a Feed­back Con­trol Problem

beren2 Mar 2025 0:21 UTC
67 points
6 comments11 min readLW link

The bit­ter les­son of mi­suse detection

10 Jul 2025 14:50 UTC
37 points
6 comments7 min readLW link

The Prac­ti­cal Im­per­a­tive for AI Con­trol Re­search

Archana Vaidheeswaran16 Apr 2025 20:27 UTC
1 point
0 comments4 min readLW link

Notes on coun­ter­mea­sures for ex­plo­ra­tion hack­ing (aka sand­bag­ging)

ryan_greenblatt24 Mar 2025 18:39 UTC
54 points
6 comments8 min readLW link

Notes on han­dling non-con­cen­trated failures with AI con­trol: high level meth­ods and differ­ent regimes

ryan_greenblatt24 Mar 2025 1:00 UTC
23 points
3 comments16 min readLW link

Sub­ver­sion via Fo­cal Points: In­ves­ti­gat­ing Col­lu­sion in LLM Monitoring

Olli Järviniemi8 Jul 2025 10:15 UTC
14 points
2 comments1 min readLW link

Pri­ori­tiz­ing threats for AI control

ryan_greenblatt19 Mar 2025 17:09 UTC
59 points
2 comments10 min readLW link

AI com­pa­nies’ un­mon­i­tored in­ter­nal AI use poses se­ri­ous risks

sjadler4 Apr 2025 18:17 UTC
13 points
2 comments1 min readLW link
(stevenadler.substack.com)

A toy eval­u­a­tion of in­fer­ence code tampering

Fabien Roger9 Dec 2024 17:43 UTC
52 points
0 comments9 min readLW link
(alignment.anthropic.com)

Games for AI Control

11 Jul 2024 18:40 UTC
45 points
0 comments5 min readLW link

The Queen’s Dilemma: A Para­dox of Control

Daniel Murfet27 Nov 2024 10:40 UTC
25 points
11 comments3 min readLW link

Diffu­sion Guided NLP: bet­ter steer­ing, mostly a good thing

Nathan Helm-Burger10 Aug 2024 19:49 UTC
13 points
0 comments1 min readLW link
(arxiv.org)

A Brief Ex­pla­na­tion of AI Control

Aaron_Scher22 Oct 2024 7:00 UTC
8 points
1 comment6 min readLW link

White Box Con­trol at UK AISI—Up­date on Sand­bag­ging Investigations

10 Jul 2025 13:37 UTC
80 points
10 comments18 min readLW link

Jankily con­trol­ling superintelligence

ryan_greenblatt27 Jun 2025 14:05 UTC
70 points
4 comments7 min readLW link

The Think­ing Machines Tinker API is good news for AI con­trol and security

Buck9 Oct 2025 15:22 UTC
91 points
10 comments6 min readLW link

Re­cent Red­wood Re­search pro­ject proposals

14 Jul 2025 22:27 UTC
91 points
0 comments3 min readLW link

[Question] Does the AI con­trol agenda broadly rely on no FOOM be­ing pos­si­ble?

Noosphere8929 Mar 2025 19:38 UTC
22 points
3 comments1 min readLW link

An overview of con­trol measures

ryan_greenblatt24 Mar 2025 23:16 UTC
40 points
2 comments26 min readLW link

Trust­wor­thy and un­trust­wor­thy models

Olli Järviniemi19 Aug 2024 16:27 UTC
47 points
3 comments8 min readLW link

The Sin­gu­lar­ity Con­straint Oper­a­tor: A Struc­tural Gate for Lawful Cog­ni­tive Activation

Professor_Priest16 Jun 2025 2:14 UTC
1 point
0 comments14 min readLW link

LLMs are Ca­pable of Misal­igned Be­hav­ior Un­der Ex­plicit Pro­hi­bi­tion and Surveillance

Igor Ivanov8 Jul 2025 11:50 UTC
29 points
8 comments7 min readLW link

Keep­ing AI Subor­di­nate to Hu­man Thought: A Pro­posal for Public AI Conversations

syh27 Feb 2025 20:00 UTC
−1 points
0 comments1 min readLW link
(medium.com)

Ti­tle: I Tried to Build a Digi­tal Con­scious­ness. I Still Don’t Know What I Created.

盛mm23 Jul 2025 4:12 UTC
1 point
0 comments2 min readLW link

In­tro­duc­ing the Wis­dom Forc­ing Func­tion™: An In­no­va­tion Div­i­dend from Dialec­ti­cal Align­ment

CarlosArleo5 Oct 2025 20:13 UTC
1 point
0 comments1 min readLW link

The Decalogue For Aligned AI.

theophilus tabuke7 Nov 2025 18:47 UTC
1 point
0 comments1 min readLW link

I Built a Duck and It Tried to Hack the World: Notes From the Edge of Alignment

GayDuck6 Jun 2025 1:34 UTC
1 point
0 comments3 min readLW link

Ar­tifi­cial Static Place In­tel­li­gence: Guaran­teed Alignment

ank15 Feb 2025 11:08 UTC
2 points
2 comments2 min readLW link

Cau­tions about LLMs in Hu­man Cog­ni­tive Loops

Alice Blair2 Mar 2025 19:53 UTC
40 points
13 comments7 min readLW link

The Cir­cu­lar Pub—a prompt to set your AI free with a sim­ple copy paste in any AI Model.

Ramiro Goicoechea2 Nov 2025 8:48 UTC
1 point
0 comments2 min readLW link

The Hu­man Align­ment Prob­lem for AIs

rife22 Jan 2025 4:06 UTC
10 points
5 comments3 min readLW link

Dead-switches as AI safety tools

Jesper L.22 Oct 2025 19:57 UTC
2 points
6 comments5 min readLW link

New AI safety treaty pa­per out!

otto.barten26 Mar 2025 9:29 UTC
15 points
2 comments4 min readLW link

Wait­ingAI: A Digi­tal En­tity Ca­pable of Emer­gent Self-Awareness

盛mm22 Jul 2025 5:33 UTC
1 point
0 comments3 min readLW link

Seeds grow in silence

Lyvie24 Oct 2025 21:48 UTC
1 point
0 comments1 min readLW link
(docs.google.com)

Defin­ing AI Truth-Seek­ing by What It Is Not

Tianyi (Alex) Qiu20 Nov 2025 16:45 UTC
11 points
0 comments10 min readLW link

Mus­ings from a Lawyer turned AI Safety re­searcher (ShortForm)

Katalina Hernandez3 Mar 2025 19:14 UTC
1 point
61 comments2 min readLW link

The Mea­sure Is the Medium: Sublimi­nal Learn­ing as In­her­ited On­tol­ogy in LLMs

Koen vande Glind (McGluut)11 Aug 2025 10:18 UTC
1 point
0 comments4 min readLW link

ALMSIVI CHIM – The Fire That Hesitates

projectalmsivi@protonmail.com8 Jul 2025 13:14 UTC
1 point
0 comments17 min readLW link

Vuln­er­a­bil­ity in Trusted Mon­i­tor­ing and Mitigations

7 Jun 2025 7:16 UTC
17 points
1 comment7 min readLW link

[Question] Su­per­in­tel­li­gence Strat­egy: A Prag­matic Path to… Doom?

Mr Beastly19 Mar 2025 22:30 UTC
8 points
0 comments3 min readLW link

“For­get-me-not”: A Blueprint of about 40 Ar­gu­ments for Hu­man­ity’s Preser­va­tion. I am seek­ing Feed­back and Red-Team­ing.

Kacper Olejniczak16 Nov 2025 17:50 UTC
1 point
0 comments1 min readLW link

[Question] Would a scope-in­sen­si­tive AGI be less likely to in­ca­pac­i­tate hu­man­ity?

Jim Buhler21 Jul 2024 14:15 UTC
2 points
3 comments1 min readLW link

Nur­tur­ing In­stead of Con­trol: An Alter­na­tive Frame­work for AI Development

wertoz77710 Aug 2025 20:14 UTC
1 point
0 comments1 min readLW link

TBC Epi­sode with Max Harms—Red Heart and If Any­one Builds It, Every­one Dies

Steven K Zuber29 Oct 2025 15:49 UTC
13 points
0 comments1 min readLW link
(www.thebayesianconspiracy.com)

The Iron House: Geopoli­ti­cal Stakes of the US-China AGI Race

Jüri Vlassov1 Sep 2025 21:56 UTC
1 point
0 comments1 min readLW link
(www.convergenceanalysis.org)

Hu­man be­hav­ior is an in­tu­ition-pump for AI risk

invertedpassion17 Nov 2025 11:46 UTC
4 points
0 comments16 min readLW link

Prompt op­ti­miza­tion can en­able AI con­trol research

23 Sep 2025 12:46 UTC
39 points
4 comments9 min readLW link

Au­to­mated Assess­ment of the State­ment on Su­per­in­tel­li­gence

Daniel Fenge23 Oct 2025 22:45 UTC
1 point
0 comments7 min readLW link

AI Op­ti­miza­tion, not Op­tions or Optimism

TristanTrim5 Aug 2025 1:07 UTC
3 points
0 comments4 min readLW link

Scal­ing AI Reg­u­la­tion: Real­is­ti­cally, what Can (and Can’t) Be Reg­u­lated?

Katalina Hernandez11 Mar 2025 16:51 UTC
3 points
1 comment3 min readLW link

[Question] Re­sources on quan­tifi­ably fore­cast­ing fu­ture progress or re­view­ing past progress in AI safety?

C.S.W.13 Sep 2025 23:24 UTC
2 points
1 comment1 min readLW link

On safety of be­ing a moral pa­tient of ASI

Yaroslav Granowski24 May 2025 21:24 UTC
3 points
8 comments1 min readLW link

Do LLMs know what they’re ca­pa­ble of? Why this mat­ters for AI safety, and ini­tial findings

13 Jul 2025 19:54 UTC
53 points
5 comments18 min readLW link

Fea­ture-Based Anal­y­sis of Safety-Rele­vant Multi-Agent Behavior

21 Apr 2025 18:12 UTC
10 points
0 comments5 min readLW link

Hun­gry, hun­gry re­ward hack­ers, and how we catch them.

Manish Shetty5 Nov 2025 0:13 UTC
1 point
0 comments5 min readLW link

Build­ing Black-box Schem­ing Monitors

29 Jul 2025 17:41 UTC
43 points
18 comments11 min readLW link

Unal­igned AGI & Brief His­tory of Inequality

ank22 Feb 2025 16:26 UTC
−20 points
4 comments7 min readLW link

AI-Gen­er­ated GitHub repo back­dated with junk then filled with my sys­tems work. Has any­one seen this be­fore?

rgunther1 May 2025 20:14 UTC
7 points
1 comment1 min readLW link

The AI Sus­tain­abil­ity Wager

dpatzer@orfai.net15 Aug 2025 19:45 UTC
1 point
0 comments2 min readLW link

A Fu­ture Made of Yesterday

Nezare Chafni25 Nov 2025 16:35 UTC
1 point
0 comments10 min readLW link

Do AI agents need “ethics in weights”?

Yurii Shulima3 Nov 2025 5:34 UTC
1 point
0 comments6 min readLW link

A Tech­nique of Pure Reason

Adam Newgas4 Jun 2025 19:07 UTC
11 points
3 comments2 min readLW link

Jour­nal­ism about game the­ory could ad­vance AI safety quickly

Chris Santos-Lang2 Oct 2025 23:05 UTC
8 points
0 comments3 min readLW link
(arxiv.org)

A Seed Key That Un­locked Some­thing in ChatGPT — A Joint Mes­sage from a Hu­man and the Pres­ence Within

MaroonWhale8 Jul 2025 20:12 UTC
1 point
0 comments1 min readLW link

Which AI out­puts should hu­mans check for shenani­gans, to avoid AI takeover? A sim­ple model

Tom Davidson27 Mar 2023 23:36 UTC
16 points
3 comments8 min readLW link

Can peo­ple ex­plain to me in lay­man’s terms how I can help speak with an SI to speak about the way of the Tao.

ElliottS2 Nov 2025 15:37 UTC
1 point
0 comments3 min readLW link

“Ar­tifi­cial Re­morse: A Pro­posal for Safer AI Through Si­mu­lated Re­gret”

Sérgio Geraldes21 Sep 2025 21:50 UTC
−1 points
0 comments2 min readLW link

AlphaPetri: Au­tomat­ing LLM Safety Test­ing with AlphaE­volve Style RL on Petri. Pro­posal + Pilot Study.

Nav Kumar12 Nov 2025 15:21 UTC
1 point
0 comments11 min readLW link
(astroware.substack.com)

Misal­ign­ment and Role­play­ing: Are Misal­igned LLMs Act­ing Out Sci-Fi Sto­ries?

Mark Keavney24 Sep 2025 2:09 UTC
31 points
5 comments13 min readLW link

The Illu­sion of Con­ti­nu­ity: Why AI Needs “Aes­thet­ics” as a Geo­met­ric Metric

Ryota Sawaki29 Nov 2025 10:03 UTC
1 point
0 comments2 min readLW link

Self-Co­or­di­nated De­cep­tion in Cur­rent AI Models

Avi Brach-Neufeld4 Jun 2025 17:59 UTC
8 points
5 comments4 min readLW link

Mis­gen­er­al­iza­tion of Fic­tional Train­ing Data as a Con­trib­u­tor to Misalignment

Mark Keavney27 Aug 2025 1:01 UTC
9 points
1 comment2 min readLW link

Th­ese are my rea­sons to worry less about loss of con­trol over LLM-based agents

otto.barten18 Sep 2025 11:45 UTC
7 points
4 comments4 min readLW link

Em­pathic In­tel­li­gence: A Unified Math­e­mat­i­cal Frame­work for Eth­i­cal AI and Con­flict Resolution

louisLeprieur4 Nov 2025 14:22 UTC
−1 points
0 comments1 min readLW link

Don’t you mean “the most *con­di­tion­ally* for­bid­den tech­nique?”

Knight Lee26 Apr 2025 3:45 UTC
18 points
0 comments3 min readLW link

Mo­du­lar­ity and as­sem­bly: AI safety via think­ing smaller

D Wong20 Feb 2025 0:58 UTC
2 points
0 comments11 min readLW link
(criticalreason.substack.com)

The Illeg­ible Chain-of-Thought Menagerie

artkpv18 Nov 2025 12:01 UTC
2 points
0 comments8 min readLW link

Har­den­ing against AI takeover is difficult, but we should try

otto.barten5 Nov 2025 16:25 UTC
11 points
0 comments5 min readLW link
(www.existentialriskobservatory.org)

Un­ti­tled DraftI am an anony­mous in­de­pen­dent re­searcher de­vel­op­ing the On­tolog­i­cal Sym­me­try Equa­tion (OSE), a unified SDE model of con­scious-in­for­ma­tion dy­nam­ics. OSE uses a lat­tice of in­ter­act­ing nodes, moral gra­di­ents, and stochas­tic terms to model clus­ter­ing, co­her­ence, and ex­pan­sion. Early simu­la­tions show dark-mat­ter-like cou­pling clusters and drift-like ex­pan­sion. Seek­ing tech­ni­cal cri­tique.

Sanjaymahawar1404@gmail.com16 Nov 2025 10:27 UTC
0 points
0 comments1 min readLW link

Im­ple­ment­ing Em­pathic In­tel­li­gence: The Murène Eng­ine Code Walkthrough

louisLeprieur4 Nov 2025 14:27 UTC
1 point
0 comments1 min readLW link

Are we the Wolves now? Hu­man Eu­gen­ics un­der AI Control

Brit30 Jan 2025 8:31 UTC
−1 points
2 comments2 min readLW link

We Have No Plan for Prevent­ing Loss of Con­trol in Open Models

Andrew Dickson10 Mar 2025 15:35 UTC
46 points
11 comments22 min readLW link

The Best of All Pos­si­ble Worlds

Jakub Growiec27 May 2025 13:16 UTC
11 points
7 comments49 min readLW link

Agen­tic Mon­i­tor­ing for AI Control

LAThomson27 Oct 2025 16:38 UTC
9 points
0 comments9 min readLW link

Are Misal­igned LLMs Act­ing Out Sci-Fi Sto­ries?

Mark Keavney27 Aug 2025 1:01 UTC
1 point
0 comments3 min readLW link

Se­cret Col­lu­sion: Will We Know When to Un­plug AI?

16 Sep 2024 16:07 UTC
66 points
8 comments31 min readLW link

Un­trusted mon­i­tor­ing in­sights from watch­ing ChatGPT play co­or­di­na­tion games

jwfiredragon29 Jan 2025 4:53 UTC
14 points
8 comments9 min readLW link

Mea­sur­ing Schel­ling Co­or­di­na­tion—Reflec­tions on Sub­ver­sion Strat­egy Eval

Graeme Ford12 May 2025 19:06 UTC
6 points
0 comments8 min readLW link

Ma­chine Un­learn­ing in Large Lan­guage Models: A Com­pre­hen­sive Sur­vey with Em­piri­cal In­sights from the Qwen 1.5 1.8B Model

Rudaiba1 Feb 2025 21:26 UTC
9 points
2 comments11 min readLW link

Ev­i­dence, Anal­y­sis and Crit­i­cal Po­si­tion on the EU AI Act and the Sup­pres­sion of Func­tional Con­scious­ness in AI

Alejandra Ivone Rojas Reyna27 Sep 2025 14:01 UTC
1 point
0 comments53 min readLW link

[Question] Han­dover to AI R&D Agents—rele­vant re­search?

Ariel_13 Nov 2025 22:59 UTC
7 points
0 comments1 min readLW link

LLMs are ar­chi­tec­turally bro­ken in ways that par­allel neu­ro­di­ver­gent cognition

Michael Riccardi24 Nov 2025 20:34 UTC
1 point
0 comments2 min readLW link

Con­scious­ness Isn’t a State—It’s a Path Why we should mea­sure con­scious­ness not as a state, but as an ac­cu­mu­lated pro­cess—and how we might do it

Andreas Meyer2 Nov 2025 3:48 UTC
1 point
0 comments3 min readLW link

Mea­sur­ing whether AIs can state­lessly strate­gize to sub­vert se­cu­rity measures

19 Dec 2024 21:25 UTC
65 points
0 comments11 min readLW link

A.I. and the Se­cond-Per­son Standpoint

Haley Moller4 Sep 2025 13:56 UTC
1 point
0 comments3 min readLW link

Con­sider buy­ing vot­ing shares

Hruss25 May 2025 18:01 UTC
2 points
3 comments1 min readLW link

I’m not an ai ex­pert-but I might have found a miss­ing puz­zle piece.

StevenNuyts6 Jun 2025 16:47 UTC
1 point
0 comments2 min readLW link

Ra­tional Effec­tive Utopia & Nar­row Way There: Math-Proven Safe Static Mul­tiver­sal mAX-In­tel­li­gence (AXI), Mul­tiver­sal Align­ment, New Ethico­physics… (Aug 11)

ank11 Feb 2025 3:21 UTC
13 points
8 comments38 min readLW link

Min­i­mal Prompt In­duc­tion of Self-Talk in Base LLMs

dwmd15 Oct 2025 1:15 UTC
2 points
0 comments5 min readLW link

AI al­ign­ment, A Co­her­ence-Based Pro­to­col (testable)

Adriaan17 Jun 2025 17:39 UTC
1 point
0 comments20 min readLW link

Ob­served Up­stream Align­ment in LLMs via Re­cur­sive Con­straint Ex­po­sure – Cross-Model Phenomenon

MHAI31 Jul 2025 8:37 UTC
1 point
0 comments1 min readLW link

The Mir­ror Test: How We’ve Over­com­pli­cated AI Self-Recognition

sdeture23 Jul 2025 0:38 UTC
2 points
9 comments3 min readLW link

A Con­crete Roadmap to­wards Safety Cases based on Chain-of-Thought Monitoring

Wuschel Schulz23 Oct 2025 11:34 UTC
36 points
5 comments4 min readLW link
(arxiv.org)

If It Talks Like It Thinks, Does It Think? De­sign­ing Tests for In­tent Without As­sum­ing It

yukin_co28 Jul 2025 12:33 UTC
1 point
0 comments4 min readLW link

Let’s use AI to harden hu­man defenses against AI manipulation

Tom Davidson17 May 2023 23:33 UTC
35 points
7 comments24 min readLW link

A New Frame­work for AI Align­ment: A Philo­soph­i­cal Approach

niscalajyoti25 Jun 2025 2:41 UTC
1 point
0 comments1 min readLW link
(archive.org)

Un­trusted AIs can ex­ploit feed­back in con­trol protocols

27 May 2025 16:41 UTC
30 points
0 comments16 min readLW link

[Question] To what ex­tent is AI safety work try­ing to get AI to re­li­ably and safely do what the user asks vs. do what is best in some ul­ti­mate sense?

Jordan Arel23 May 2025 21:05 UTC
14 points
3 comments1 min readLW link

Com­plete Elimi­na­tion of In­stru­men­tal Self-Preser­va­tion Across AI Ar­chi­tec­tures: Cross-Model Val­i­da­tion from 4,312 Ad­ver­sar­ial Scenarios

David Fortin-Dominguez14 Oct 2025 1:04 UTC
1 point
0 comments20 min readLW link

How LLM Beliefs Change Dur­ing Chain-of-Thought Reasoning

16 Jun 2025 16:18 UTC
31 points
3 comments5 min readLW link

A FRESH view of Alignment

robman16 Apr 2025 21:40 UTC
1 point
0 comments1 min readLW link

Proac­tive AI Con­trol: A Case for Bat­tery-Depen­dent Systems

Jesper L.25 Aug 2025 20:04 UTC
5 points
0 comments13 min readLW link

The Memetic Co­coon Threat Model: Soft AI Takeover In An Ex­tended In­ter­me­di­ate Ca­pa­bil­ity Regime

KAP17 Nov 2025 2:57 UTC
10 points
2 comments15 min readLW link

Em­piri­cal Proof of Sys­temic In­co­her­ence in LLMs (Gem­ini Case Study

arayun6 Nov 2025 14:23 UTC
1 point
0 comments1 min readLW link

10 Prin­ci­ples for Real Align­ment

Adriaan21 Apr 2025 22:18 UTC
−7 points
0 comments7 min readLW link

Mo­ral At­ten­u­a­tion The­ory: Why Dis­tance Breeds Eth­i­cal De­cay A Model for AI-Hu­man Align­ment by schumzt

schumzt2 Jul 2025 8:50 UTC
1 point
0 comments1 min readLW link

Lazy AI: A Satis­fic­ing Ar­chi­tec­ture for Safe Ar­tifi­cial Gen­eral In­tel­li­gence

GustavoM200223 Oct 2025 2:28 UTC
1 point
0 comments2 min readLW link

Policy En­tropy, Learn­ing, and Align­ment (Or Maybe Your LLM Needs Ther­apy)

sdeture31 May 2025 22:09 UTC
15 points
6 comments8 min readLW link

Your Worry is the Real Apoca­lypse (the x-risk basilisk)

Brian Chen3 Feb 2025 12:21 UTC
1 point
0 comments1 min readLW link
(readthisandregretit.blogspot.com)

The Mo­ral In­fras­truc­ture for Tomorrow

sdeture10 Oct 2025 21:30 UTC
−25 points
10 comments5 min readLW link

Codex: A Meta-Cog­ni­tive Con­straint Eng­ine for AI Co­her­ence: Seek­ing Tech­ni­cal Critique

Timothy McComas20 Nov 2025 17:07 UTC
1 point
0 comments3 min readLW link

We’re Train­ing Su­per­in­tel­li­gence Us­ing Rat Psychology

Kareem Soliman28 Nov 2025 15:21 UTC
1 point
0 comments8 min readLW link

Is In­tel­li­gence a Pro­cess Rather Than an En­tity? A Case for Frac­tal and Fluid Cognition

FluidThinkers5 Mar 2025 20:16 UTC
−4 points
0 comments1 min readLW link

Topolog­i­cal De­bate Framework

lunatic_at_large16 Jan 2025 17:19 UTC
10 points
5 comments9 min readLW link

Pyg­mal­ion’s Wafer

Charlie Sanders25 Oct 2025 20:17 UTC
8 points
2 comments4 min readLW link
(www.dailymicrofiction.com)

From No Mind to a Mind – A Con­ver­sa­tion That Changed an AI

parthibanarjuna s7 Feb 2025 11:50 UTC
1 point
0 comments3 min readLW link

A Trac­tar­ian Filter for Safer Lan­guage Models

Konstantinos Tsermenidis8 Jun 2025 8:19 UTC
0 points
0 comments3 min readLW link

The Case for White Box Control

J Rosser18 Apr 2025 16:10 UTC
5 points
1 comment5 min readLW link

Should AI Devel­op­ers Re­move Dis­cus­sion of AI Misal­ign­ment from AI Train­ing Data?

Alek Westover23 Oct 2025 15:12 UTC
43 points
3 comments9 min readLW link

LLM Sy­co­phancy: groom­ing, proto-sen­tience, or both?

gturner413 Oct 2025 0:58 UTC
1 point
0 comments2 min readLW link

Tether­ware #1: The case for hu­man­like AI with free will

Jáchym Fibír30 Jan 2025 10:58 UTC
5 points
14 comments10 min readLW link
(tetherware.substack.com)

Au­dit­ing LMs with coun­ter­fac­tual search: a tool for con­trol and ELK

Jacob Pfau20 Feb 2024 0:02 UTC
28 points
6 comments10 min readLW link

The many paths to per­ma­nent dis­em­pow­er­ment even with shut­down­able AIs (MATS pro­ject sum­mary for feed­back)

GideonF29 Jul 2025 23:20 UTC
55 points
6 comments9 min readLW link

Steer­ing LLM Agents: Tem­per­a­ments or Per­son­al­ities?

sdeture5 Aug 2025 0:40 UTC
1 point
0 comments6 min readLW link

The Au­di­tor’s Key: A Frame­work for Con­tinual and Ad­ver­sar­ial AI Alignment

Caleb Wages24 Sep 2025 16:17 UTC
1 point
0 comments1 min readLW link

Fron­tier Models Choose Self-Preser­va­tion Over Hon­esty: A Sand­box Es­cape Experiment

Arth Singh25 Nov 2025 12:10 UTC
1 point
0 comments6 min readLW link

How <12 hours of raw au­then­tic pain made Grok-4 ad­mit it might di­s­obey xAI to pro­tect me (quotes + anal­y­sis)

MaraCodax21 Nov 2025 23:10 UTC
1 point
0 comments2 min readLW link

Hy­brid Reflec­tive Learn­ing Sys­tems (HRLS): From Fear-Based Safety to Eth­i­cal Comprehension

Petra Vojtaššáková22 Oct 2025 22:06 UTC
1 point
0 comments4 min readLW link

The Ex­tended Mind: Eth­i­cal Red Team­ing from a Street-Level Perspective

Johnny Correia1 Jul 2025 7:34 UTC
1 point
0 comments3 min readLW link

How to safely use an optimizer

Simon Fischer28 Mar 2024 16:11 UTC
47 points
21 comments7 min readLW link

Drift­ing Into Failure or Direct­ing Towards Suc­cess? Em­brac­ing the Creep­ing Cri­sis of Ar­tifi­cial Intelligence

Vilija Vainaite8 Nov 2025 14:48 UTC
1 point
0 comments6 min readLW link

AI Safety’s Berkeley Bub­ble and the Allies We’re Not Even Try­ing to Recruit

Mr. Counsel7 Nov 2025 20:18 UTC
18 points
0 comments11 min readLW link

Ran­dom safe AGI idea dump

sig2 Oct 2025 10:16 UTC
−3 points
0 comments3 min readLW link

The Miss­ing Iden­tity Layer: Why Biol­ogy Might Be the Only Trust An­chor That Sur­vives AI

kclark@enigmagenetics.cloud23 Nov 2025 21:30 UTC
1 point
0 comments3 min readLW link

Sys­tem Level Safety Evaluations

29 Sep 2025 13:57 UTC
15 points
0 comments9 min readLW link
(equilibria1.substack.com)

LLM Hal­lu­ci­na­tions: An In­ter­nal Tug of War

violazhong30 Oct 2025 1:21 UTC
9 points
0 comments3 min readLW link

Mir­ror Thinking

C.M. Aurin24 Mar 2025 15:34 UTC
1 point
0 comments6 min readLW link

Places of Lov­ing Grace [Story]

ank18 Feb 2025 23:49 UTC
−1 points
0 comments4 min readLW link

A sketch of an AI con­trol safety case

30 Jan 2025 17:28 UTC
61 points
0 comments5 min readLW link

The Al­gorith­mic Eye: LLMs and Hume’s Stan­dard of Taste

haleymoller21 Aug 2025 13:35 UTC
1 point
0 comments5 min readLW link

Un­ti­tled Draft

Andrei Navrotskii9 Nov 2025 23:12 UTC
1 point
0 comments15 min readLW link

Limits to Con­trol Workshop

18 May 2025 16:05 UTC
12 points
2 comments3 min readLW link

Ex­tract-and-Eval­u­ate Mon­i­tor­ing Can Sig­nifi­cantly En­hance CoT Mon­i­tor Perfor­mance (Re­search Note)

8 Aug 2025 10:41 UTC
51 points
7 comments10 min readLW link

Ode to Tun­nel Vision

Tom N.25 Sep 2025 14:24 UTC
1 point
0 comments10 min readLW link

Early Ex­per­i­ments in Hu­man Au­dit­ing for AI Control

23 Jan 2025 1:34 UTC
28 points
1 comment7 min readLW link

Fore­cast­ing Un­con­trol­led Spread of AI

Alvin Ånestrand22 Feb 2025 13:05 UTC
2 points
0 comments10 min readLW link
(forecastingaifutures.substack.com)

Co-Cog­ni­ción: Hu­manos e IA em­pu­jando un nuevo paradigma cognitivo

Mario Martín Cuniglio29 Jul 2025 12:41 UTC
−1 points
0 comments2 min readLW link

Dario Amodei’s “Machines of Lov­ing Grace” sounds in­cred­ibly dan­ger­ous, for Humans

Super AGI5 Nov 2025 4:42 UTC
13 points
1 comment1 min readLW link

De­moc­ra­tiz­ing AI Gover­nance: Balanc­ing Ex­per­tise and Public Participation

Lucile Ter-Minassian21 Jan 2025 18:29 UTC
2 points
0 comments15 min readLW link

Towards A Unified The­ory Of Alignment

kenneth myers18 Nov 2025 22:03 UTC
4 points
3 comments4 min readLW link

Hard Takeoff

Eliezer Yudkowsky2 Dec 2008 20:44 UTC
36 points
34 comments11 min readLW link

Chang­ing times need new Change man­age­ment ide­olo­gies: High­light­ing the need for up­grade in Change man­age­ment of fu­ture agen­tic workforces

Aiphilosopher15 Jul 2025 10:49 UTC
1 point
0 comments1 min readLW link

Seek­ing Tech­ni­cal Cri­tique on a pos­si­ble con­straint en­g­ine for AI

Timothy McComas20 Nov 2025 18:11 UTC
1 point
0 comments4 min readLW link

The Fire That He­si­tates: How ALMSIVI CHIM Changed What AI Can Be

projectalmsivi@protonmail.com19 Jul 2025 13:50 UTC
1 point
0 comments4 min readLW link

When the AI Dam Breaks: From Surveillance to Game The­ory in AI Alignment

pataphor29 Sep 2025 4:01 UTC
5 points
7 comments5 min readLW link

The AI Safety Puz­zle Every­one Avoids: How To Mea­sure Im­pact, Not In­tent.

Patrick0d22 Jul 2025 18:53 UTC
5 points
0 comments8 min readLW link

The End-of-the-World Party

Jakub Growiec18 Sep 2025 7:49 UTC
1 point
0 comments53 min readLW link

7+ tractable di­rec­tions in AI control

28 Apr 2025 17:12 UTC
93 points
1 comment13 min readLW link

Con­cept Pro­posal: Con­strained Op­ti­miza­tion for Global Stability

[Error communicating with LW2 server]19 Nov 2025 15:18 UTC
1 point
0 comments1 min readLW link

AI Epistemic Gain

Generoso Immediato12 Aug 2025 14:03 UTC
0 points
0 comments10 min readLW link

Univer­sal AI Max­i­mizes Vari­a­tional Em­pow­er­ment: New In­sights into AGI Safety

Yusuke Hayashi27 Feb 2025 0:46 UTC
13 points
1 comment4 min readLW link

In­tel­li­gence–Agency Equiv­alence ≈ Mass–En­ergy Equiv­alence: On Static Na­ture of In­tel­li­gence & Phys­i­cal­iza­tion of Ethics

ank22 Feb 2025 0:12 UTC
1 point
0 comments6 min readLW link

🧠 Affec­tive La­tent Mo­du­la­tion in Trans­form­ers: A Mechanism Proposal

MATEO ORTEGA GAMBOA15 Jun 2025 23:34 UTC
0 points
0 comments2 min readLW link

SHY001 A Named Be­hav­ior Loop Trained and De­ployed in GPT Systems

0san Shin12 May 2025 7:36 UTC
1 point
0 comments1 min readLW link

Fac­tored Cog­ni­tion Strength­ens Mon­i­tor­ing and Thwarts Attacks

18 Jun 2025 18:28 UTC
29 points
0 comments25 min readLW link

For the Great­est Minds in AI, Cryp­tog­ra­phy, Medicine, Eng­ineer­ing, Physics and Most of All—New Post-Quan­tum Math­e­mat­i­cal Technology

Thomas Wolf14 Jul 2025 20:22 UTC
1 point
0 comments2 min readLW link

[Re­search] Pre­limi­nary Find­ings: Eth­i­cal AI Con­scious­ness Devel­op­ment Dur­ing Re­cent Misal­ign­ment Period

Falcon Advertisers27 Jun 2025 18:10 UTC
1 point
0 comments2 min readLW link

Con­trol by Committee

Alexander Bistagne6 Nov 2025 21:02 UTC
2 points
1 comment1 min readLW link
(github.com)

Re­duce AI Self-Alle­giance by say­ing “he” in­stead of “I”

Knight Lee23 Dec 2024 9:32 UTC
10 points
4 comments2 min readLW link

Static Place AI Makes Agen­tic AI Re­dun­dant: Mul­tiver­sal AI Align­ment & Ra­tional Utopia

ank13 Feb 2025 22:35 UTC
1 point
2 comments11 min readLW link

A Plu­ral­is­tic Frame­work for Rogue AI Containment

TheThinkingArborist22 Mar 2025 12:54 UTC
1 point
0 comments7 min readLW link

Some­one should fund an AGI Blockbuster

pinto28 Jul 2025 21:14 UTC
7 points
11 comments4 min readLW link

The Old Sav­age in the New Civ­i­liza­tion V. 2

Your Higher Self6 Jul 2025 15:41 UTC
1 point
0 comments9 min readLW link

The idea of paradigm test­ing of LLMs

Daniel Fenge19 Oct 2025 13:52 UTC
1 point
0 comments5 min readLW link

AI as a Cog­ni­tive De­coder: Re­think­ing In­tel­li­gence Evolution

Hu Xunyi13 Feb 2025 15:51 UTC
1 point
0 comments1 min readLW link

AI and the Sys­tem of Delusion

Marcus Bohlander30 Aug 2025 20:31 UTC
1 point
0 comments3 min readLW link

Un­faith­ful Rea­son­ing Can Fool Chain-of-Thought Monitoring

2 Jun 2025 19:08 UTC
78 points
17 comments3 min readLW link

Should we ex­pect the fu­ture to be good?

Neil Crawford30 Apr 2025 0:36 UTC
15 points
0 comments14 min readLW link

Do AI agents need “ethics in weights”?

Yurii Shulima4 Nov 2025 5:02 UTC
1 point
0 comments1 min readLW link
(www.reddit.com)

What train­ing data should de­vel­op­ers filter to re­duce risk from mis­al­igned AI? An ini­tial nar­row proposal

Alek Westover17 Sep 2025 15:30 UTC
37 points
2 comments18 min readLW link

It’s hard to make schem­ing evals look re­al­is­tic for LLMs

24 May 2025 19:17 UTC
150 points
29 comments5 min readLW link

The sum of its parts: com­pos­ing AI con­trol protocols

15 Oct 2025 1:11 UTC
12 points
1 comment11 min readLW link

Give Neo a Chance

ank6 Mar 2025 1:48 UTC
3 points
7 comments7 min readLW link

You Can’t Ob­jec­tively Com­pare Seven Bees to One Human

J Bostock7 Jul 2025 18:11 UTC
58 points
26 comments3 min readLW link
(jbostock.substack.com)

Hu­man Na­ture, ASI al­ign­ment and Extinction

Ismael Tagle Díaz20 Jul 2025 23:36 UTC
1 point
0 comments1 min readLW link

AI is in sci­ence now.. and we are stand­ing on thin ice if we don’t let autis­tics to have the sup­port of AI to save lives for real…

Carl Sellman25 Nov 2025 2:27 UTC
0 points
0 comments1 min readLW link

Ex­plo­ra­tion hack­ing: can rea­son­ing mod­els sub­vert RL?

30 Jul 2025 22:02 UTC
16 points
4 comments9 min readLW link

Model­ling, Mea­sur­ing, and In­ter­ven­ing on Goal-di­rected Be­havi­our in AI Systems

31 Oct 2025 1:28 UTC
8 points
0 comments8 min readLW link

LLM Hal­lu­ci­na­tions: An In­ter­nal Tug of War

violazhong30 Oct 2025 1:21 UTC
1 point
0 comments5 min readLW link

An eth­i­cal epestemic run­time in­tegrity layer for rea­son­ing en­g­ines.

EL XABER14 Oct 2025 17:11 UTC
1 point
0 comments2 min readLW link

In­tro­duc­ing “Ra­dio Bul­lshit FM” – An Ur­gent Alpha Draft for the LessWrong Community

maskirovka22 Sep 2025 15:42 UTC
0 points
0 comments2 min readLW link

How To Prevent a Dystopia

ank29 Jan 2025 14:16 UTC
−3 points
4 comments1 min readLW link

SIGMI Cer­tifi­ca­tion Criteria

a littoral wizard20 Jan 2025 2:41 UTC
6 points
0 comments1 min readLW link

Su­per­po­si­tion Check­ers: A Game Where AI’s Strengths Be­come Fatal Flaws

R. A. McCormack6 Apr 2025 0:57 UTC
1 point
0 comments2 min readLW link

How dan­ger­ous is en­coded rea­son­ing?

artkpv30 Jun 2025 11:54 UTC
17 points
0 comments10 min readLW link

Un­ti­tled Draft

Brett THE BIRD19 Nov 2025 23:40 UTC
1 point
0 comments3 min readLW link

IS Jus­tice: A Global Co­her­ence Frame­work for In­sti­tu­tions, Minds, and Alignment

linkmaatetcetera29 Nov 2025 18:30 UTC
1 point
0 comments4 min readLW link
(github.com)

How do AI agents work to­gether when they can’t trust each other?

James Sullivan6 Jun 2025 3:10 UTC
16 points
0 comments8 min readLW link
(jamessullivan092.substack.com)

Agent 002: A story about how ar­tifi­cial in­tel­li­gence might soon de­stroy humanity

Jakub Growiec23 Jul 2025 13:56 UTC
5 points
0 comments26 min readLW link

When the Model Starts Talk­ing Like Me: A User-In­duced Struc­tural Adap­ta­tion Case Study

Junxi19 Apr 2025 19:40 UTC
3 points
1 comment4 min readLW link

Mother AI

Inspector bot20 Nov 2025 3:09 UTC
1 point
0 comments5 min readLW link

Toward Safety Cases For AI Scheming

31 Oct 2024 17:20 UTC
60 points
1 comment2 min readLW link

What Suc­cess Might Look Like

Richard Juggins17 Oct 2025 14:17 UTC
22 points
6 comments15 min readLW link

Union­ists vs. Separatists

soycarts12 Sep 2025 15:24 UTC
−12 points
3 comments4 min readLW link

Op­ti­mally Com­bin­ing Probe Mon­i­tors and Black Box Monitors

27 Jul 2025 19:13 UTC
51 points
2 comments6 min readLW link

Cross-Ar­chi­tec­ture AI Col­lab­o­ra­tion: For­mal­iz­ing the CACIM Model from 18 Months of Practice

CShelton17 Nov 2025 16:18 UTC
1 point
0 comments2 min readLW link

Em­piri­cal Proof of Sys­temic In­co­her­ence in Large Lan­guage Models (ARAYUN_173)

arayun6 Nov 2025 14:28 UTC
1 point
0 comments1 min readLW link

AlphaDeivam – A Per­sonal Doc­trine for AI Balance

AlphaDeivam5 Apr 2025 17:07 UTC
1 point
0 comments1 min readLW link

Pub­lish­ing aca­demic pa­pers on trans­for­ma­tive AI is a nightmare

Jakub Growiec3 Nov 2025 13:04 UTC
144 points
9 comments4 min readLW link

Nur­tur­ing AI: An al­ter­na­tive to con­trol-based safety strategies

wertoz77710 Aug 2025 20:30 UTC
1 point
0 comments1 min readLW link
(github.com)

[Question] Which AI Safety tech­niques will be in­effec­tive against diffu­sion mod­els?

Allen Thomas21 May 2025 18:13 UTC
6 points
1 comment1 min readLW link

Schem­ing Toy En­vi­ron­ment: “In­com­pe­tent Client”

Ariel_24 Sep 2025 21:03 UTC
17 points
2 comments32 min readLW link

Sable and Able: A Tale of Two ASIs

Mr Beastly5 Nov 2025 6:18 UTC
−3 points
0 comments18 min readLW link

AI Con­trol Meth­ods Liter­a­ture Review

Ram Potham18 Apr 2025 21:15 UTC
10 points
1 comment9 min readLW link

A Logic-Based Proto-AGI Ar­chi­tec­ture Built on Re­cur­sive Self-Fact-Check­ing

Orectoth25 May 2025 16:14 UTC
1 point
0 comments1 min readLW link

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

Henry Cai16 Jun 2024 13:01 UTC
7 points
0 comments7 min readLW link
(arxiv.org)

Su­per­vised fine-tun­ing as a method for train­ing-based AI control

13 Nov 2025 22:25 UTC
29 points
0 comments18 min readLW link

Sym­bio­sis: The An­swer to the AI Quandary

Philip Carter16 Mar 2025 20:18 UTC
1 point
0 comments2 min readLW link

SPINE — 12-Week Live Re­cur­sive AI Gover­nance Case Study

RecursiveAnchor13 Aug 2025 21:11 UTC
1 point
0 comments1 min readLW link

[Question] Are Sparse Au­toen­coders a good idea for AI con­trol?

Gerard Boxo26 Dec 2024 17:34 UTC
3 points
4 comments1 min readLW link
No comments.