AI Control

TagLast edit: 17 Aug 2024 2:00 UTC by Ben Pace

AI Control in the context of AI Alignment is a category of plans that aim to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures. From The case for ensuring that powerful AIs are controlled:

In this post, we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.

and

There are two main lines of defense you could employ to prevent schemers from causing catastrophes.
Alignment: Ensure that your models aren’t scheming.^[2]
Control: Ensure that even if your models are scheming, you’ll be safe, because they are not capable of subverting your safety measures.^[3]

The Case Against AI Control Research

johnswentworth21 Jan 2025 16:03 UTC

366 points

84 comments6 min readLW link

AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt and Kshitij Sachan

13 Dec 2023 15:51 UTC

239 points

24 comments10 min readLW link 4 reviews

The case for ensuring that powerful AIs are controlled

ryan_greenblatt and Buck

24 Jan 2024 16:11 UTC

267 points

73 comments28 min readLW link

AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt

DanielFilan11 Apr 2024 21:30 UTC

69 points

10 comments107 min readLW link

How useful is “AI Control” as a framing on AI X-Risk?

habryka and ryan_greenblatt

14 Mar 2024 18:06 UTC

70 points

4 comments34 min readLW link

Critiques of the AI control agenda

Jozdien14 Feb 2024 19:25 UTC

48 points

14 comments9 min readLW link

How to prevent collusion when using untrusted models to monitor each other

Buck25 Sep 2024 18:58 UTC

91 points

12 comments22 min readLW link

Catching AIs red-handed

ryan_greenblatt and Buck

5 Jan 2024 17:43 UTC

113 points

27 comments17 min readLW link

Schelling game evaluations for AI control

Olli Järviniemi8 Oct 2024 12:01 UTC

71 points

5 comments11 min readLW link

AI Control May Increase Existential Risk

Jan_Kulveit11 Mar 2025 14:30 UTC

101 points

13 comments1 min readLW link

Notes on control evaluations for safety cases

ryan_greenblatt, Buck and Fabien Roger

28 Feb 2024 16:15 UTC

49 points

0 comments32 min readLW link

Ctrl-Z: Controlling AI Agents via Resampling

Aryan Bhatt, Buck, Adam Kaufman , Cody Rushing and Tyler Tracy

16 Apr 2025 16:21 UTC

124 points

0 comments20 min readLW link

Why imperfect adversarial robustness doesn’t doom AI control

Buck and Claude+

18 Nov 2024 16:05 UTC

62 points

25 comments2 min readLW link

Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming

Buck10 Oct 2024 13:36 UTC

101 points

4 comments13 min readLW link

Preventing Language Models from hiding their reasoning

Fabien Roger and ryan_greenblatt

31 Oct 2023 14:34 UTC

120 points

15 comments12 min readLW link 1 review

The Case for Mixed Deployment

Cleo Nardo11 Sep 2025 6:14 UTC

41 points

4 comments4 min readLW link

Four places where you can put LLM monitoring

Fabien Roger and Buck

9 Aug 2025 23:10 UTC

49 points

0 comments7 min readLW link

Protocol evaluations: good analogies vs control

Fabien Roger19 Feb 2024 18:00 UTC

42 points

10 comments11 min readLW link

What’s the short timeline plan?

Marius Hobbhahn2 Jan 2025 14:59 UTC

361 points

51 comments23 min readLW link

Sabotage Evaluations for Frontier Models

David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez and Buck

18 Oct 2024 22:33 UTC

95 points

56 comments6 min readLW link

(assets.anthropic.com)

The Diligent Turing Test

super22 Jul 2025 19:53 UTC

1 point

0 comments3 min readLW link

Putting up Bumpers

Sam Bowman23 Apr 2025 16:05 UTC

55 points

14 comments2 min readLW link

AI Safety Interventions

Gunnar_Zarncke24 Nov 2025 22:28 UTC

28 points

0 comments10 min readLW link

An overview of areas of control work

ryan_greenblatt25 Mar 2025 22:02 UTC

32 points

0 comments28 min readLW link

What’s worse, spies or schemers?

Buck and Julian Stastny

9 Jul 2025 14:37 UTC

51 points

2 comments5 min readLW link

Stopping unaligned LLMs is easy!

Yair Halberstadt3 Feb 2025 15:38 UTC

−3 points

11 comments2 min readLW link

On Fleshling Safety: A Debate by Klurl and Trapaucius.

Eliezer Yudkowsky26 Oct 2025 23:44 UTC

253 points

52 comments79 min readLW link

Trusted monitoring, but with deception probes.

Avi Parrack, StefanHex and Cleo Nardo

23 Jul 2025 5:26 UTC

31 points

0 comments4 min readLW link

(arxiv.org)

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Buck and Julian Stastny

8 May 2025 19:06 UTC

78 points

3 comments15 min readLW link

Alignment Proposal: Adversarially Robust Augmentation and Distillation

Cole Wyeth and abramdemski

25 May 2025 12:58 UTC

56 points

47 comments13 min readLW link

Coup probes: Catching catastrophes with probes trained off-policy

Fabien Roger17 Nov 2023 17:58 UTC

93 points

9 comments11 min readLW link 1 review

NYU Code Debates Update/Postmortem

David Rein24 May 2024 16:08 UTC

27 points

4 comments10 min readLW link

Anti-Superpersuasion Interventions

niplav and Claude+

23 Jul 2025 15:18 UTC

21 points

1 comment5 min readLW link

S-Expressions as a Design Language: A Tool for Deconfusion in Alignment

Johannes C. Mayer19 Jun 2025 19:03 UTC

5 points

0 comments6 min readLW link

Win/continue/lose scenarios and execute/replace/audit protocols

Buck15 Nov 2024 15:47 UTC

64 points

2 comments7 min readLW link

Using Dangerous AI, But Safely?

habryka16 Nov 2024 4:29 UTC

17 points

2 comments43 min readLW link

Handling schemers if shutdown is not an option

Buck18 Apr 2025 14:39 UTC

39 points

2 comments14 min readLW link

How can we solve diffuse threats like research sabotage with AI control?

Vivek Hebbar30 Apr 2025 19:23 UTC

52 points

1 comment8 min readLW link

Thoughts on the conservative assumptions in AI control

Buck17 Jan 2025 19:23 UTC

91 points

5 comments13 min readLW link

Constraining Minds, Not Goals: A Structural Approach to AI Alignment

Johannes C. Mayer13 Jun 2025 21:06 UTC

25 points

0 comments9 min readLW link

Reward button alignment

Steven Byrnes22 May 2025 17:36 UTC

50 points

15 comments12 min readLW link

Making the case for average-case AI Control

Nathaniel Mitrani5 Feb 2025 18:56 UTC

5 points

0 comments5 min readLW link

Formal confinment prototype

Quinn24 Nov 2025 12:57 UTC

8 points

0 comments1 min readLW link

(github.com)

Please Measure Verification Burden

Quinn23 Nov 2025 17:25 UTC

16 points

4 comments4 min readLW link

Maintaining Alignment during RSI as a Feedback Control Problem

beren2 Mar 2025 0:21 UTC

67 points

6 comments11 min readLW link

The bitter lesson of misuse detection

Hadrien and Charbel-Raphaël

10 Jul 2025 14:50 UTC

37 points

6 comments7 min readLW link

The Practical Imperative for AI Control Research

Archana Vaidheeswaran16 Apr 2025 20:27 UTC

1 point

0 comments4 min readLW link

Notes on countermeasures for exploration hacking (aka sandbagging)

ryan_greenblatt24 Mar 2025 18:39 UTC

54 points

6 comments8 min readLW link

Notes on handling non-concentrated failures with AI control: high level methods and different regimes

ryan_greenblatt24 Mar 2025 1:00 UTC

23 points

3 comments16 min readLW link

Subversion via Focal Points: Investigating Collusion in LLM Monitoring

Olli Järviniemi8 Jul 2025 10:15 UTC

14 points

2 comments1 min readLW link

Prioritizing threats for AI control

ryan_greenblatt19 Mar 2025 17:09 UTC

59 points

2 comments10 min readLW link

AI companies’ unmonitored internal AI use poses serious risks

sjadler4 Apr 2025 18:17 UTC

13 points

2 comments1 min readLW link

(stevenadler.substack.com)

A toy evaluation of inference code tampering

Fabien Roger9 Dec 2024 17:43 UTC

52 points

0 comments9 min readLW link

(alignment.anthropic.com)

Games for AI Control

charlie_griffin and Buck

11 Jul 2024 18:40 UTC

45 points

0 comments5 min readLW link

The Queen’s Dilemma: A Paradox of Control

Daniel Murfet27 Nov 2024 10:40 UTC

25 points

11 comments3 min readLW link

Diffusion Guided NLP: better steering, mostly a good thing

Nathan Helm-Burger10 Aug 2024 19:49 UTC

13 points

0 comments1 min readLW link

(arxiv.org)

A Brief Explanation of AI Control

Aaron_Scher22 Oct 2024 7:00 UTC

8 points

1 comment6 min readLW link

White Box Control at UK AISI—Update on Sandbagging Investigations

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood and Alan Cooney

10 Jul 2025 13:37 UTC

80 points

10 comments18 min readLW link

Jankily controlling superintelligence

ryan_greenblatt27 Jun 2025 14:05 UTC

70 points

4 comments7 min readLW link

The Thinking Machines Tinker API is good news for AI control and security

Buck9 Oct 2025 15:22 UTC

91 points

10 comments6 min readLW link

Recent Redwood Research project proposals

ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman , Tyler Tracy, Aryan Bhatt and Joey Yudelson

14 Jul 2025 22:27 UTC

91 points

0 comments3 min readLW link

[Question] Does the AI control agenda broadly rely on no FOOM being possible?

Noosphere8929 Mar 2025 19:38 UTC

22 points

3 comments1 min readLW link

An overview of control measures

ryan_greenblatt24 Mar 2025 23:16 UTC

40 points

2 comments26 min readLW link

Trustworthy and untrustworthy models

Olli Järviniemi19 Aug 2024 16:27 UTC

47 points

3 comments8 min readLW link

The Singularity Constraint Operator: A Structural Gate for Lawful Cognitive Activation

Professor_Priest16 Jun 2025 2:14 UTC

1 point

0 comments14 min readLW link

LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance

Igor Ivanov8 Jul 2025 11:50 UTC

29 points

8 comments7 min readLW link

Keeping AI Subordinate to Human Thought: A Proposal for Public AI Conversations

syh27 Feb 2025 20:00 UTC

−1 points

0 comments1 min readLW link

(medium.com)

Title: I Tried to Build a Digital Consciousness. I Still Don’t Know What I Created.

盛mm23 Jul 2025 4:12 UTC

1 point

0 comments2 min readLW link

Introducing the Wisdom Forcing Function™: An Innovation Dividend from Dialectical Alignment

CarlosArleo5 Oct 2025 20:13 UTC

1 point

0 comments1 min readLW link

The Decalogue For Aligned AI.

theophilus tabuke7 Nov 2025 18:47 UTC

1 point

0 comments1 min readLW link

I Built a Duck and It Tried to Hack the World: Notes From the Edge of Alignment

GayDuck6 Jun 2025 1:34 UTC

1 point

0 comments3 min readLW link

Artificial Static Place Intelligence: Guaranteed Alignment

ank15 Feb 2025 11:08 UTC

2 points

2 comments2 min readLW link

Cautions about LLMs in Human Cognitive Loops

Alice Blair2 Mar 2025 19:53 UTC

40 points

13 comments7 min readLW link

The Circular Pub—a prompt to set your AI free with a simple copy paste in any AI Model.

Ramiro Goicoechea2 Nov 2025 8:48 UTC

1 point

0 comments2 min readLW link

The Human Alignment Problem for AIs

rife22 Jan 2025 4:06 UTC

10 points

5 comments3 min readLW link

Dead-switches as AI safety tools

Jesper L.22 Oct 2025 19:57 UTC

2 points

6 comments5 min readLW link

New AI safety treaty paper out!

otto.barten26 Mar 2025 9:29 UTC

15 points

2 comments4 min readLW link

WaitingAI: A Digital Entity Capable of Emergent Self-Awareness

盛mm22 Jul 2025 5:33 UTC

1 point

0 comments3 min readLW link

Seeds grow in silence

Lyvie24 Oct 2025 21:48 UTC

1 point

0 comments1 min readLW link

(docs.google.com)

Defining AI Truth-Seeking by What It Is Not

Tianyi (Alex) Qiu20 Nov 2025 16:45 UTC

11 points

0 comments10 min readLW link

Musings from a Lawyer turned AI Safety researcher (ShortForm)

Katalina Hernandez3 Mar 2025 19:14 UTC

1 point

61 comments2 min readLW link

The Measure Is the Medium: Subliminal Learning as Inherited Ontology in LLMs

Koen vande Glind (McGluut)11 Aug 2025 10:18 UTC

1 point

0 comments4 min readLW link

ALMSIVI CHIM – The Fire That Hesitates

projectalmsivi@protonmail.com8 Jul 2025 13:14 UTC

1 point

0 comments17 min readLW link

Vulnerability in Trusted Monitoring and Mitigations

Wen Xing and Perusha Moodley

7 Jun 2025 7:16 UTC

17 points

1 comment7 min readLW link

[Question] Superintelligence Strategy: A Pragmatic Path to… Doom?

Mr Beastly19 Mar 2025 22:30 UTC

8 points

0 comments3 min readLW link

“Forget-me-not”: A Blueprint of about 40 Arguments for Humanity’s Preservation. I am seeking Feedback and Red-Teaming.

Kacper Olejniczak16 Nov 2025 17:50 UTC

1 point

0 comments1 min readLW link

[Question] Would a scope-insensitive AGI be less likely to incapacitate humanity?

Jim Buhler21 Jul 2024 14:15 UTC

2 points

3 comments1 min readLW link

Nurturing Instead of Control: An Alternative Framework for AI Development

wertoz77710 Aug 2025 20:14 UTC

1 point

0 comments1 min readLW link

TBC Episode with Max Harms—Red Heart and If Anyone Builds It, Everyone Dies

Steven K Zuber29 Oct 2025 15:49 UTC

13 points

0 comments1 min readLW link

(www.thebayesianconspiracy.com)

The Iron House: Geopolitical Stakes of the US-China AGI Race

Jüri Vlassov1 Sep 2025 21:56 UTC

1 point

0 comments1 min readLW link

(www.convergenceanalysis.org)

Human behavior is an intuition-pump for AI risk

invertedpassion17 Nov 2025 11:46 UTC

4 points

0 comments16 min readLW link

Prompt optimization can enable AI control research

Mia Hopman and ZachParent

23 Sep 2025 12:46 UTC

39 points

4 comments9 min readLW link

Automated Assessment of the Statement on Superintelligence

Daniel Fenge23 Oct 2025 22:45 UTC

1 point

0 comments7 min readLW link

AI Optimization, not Options or Optimism

TristanTrim5 Aug 2025 1:07 UTC

3 points

0 comments4 min readLW link

Scaling AI Regulation: Realistically, what Can (and Can’t) Be Regulated?

Katalina Hernandez11 Mar 2025 16:51 UTC

3 points

1 comment3 min readLW link

[Question] Resources on quantifiably forecasting future progress or reviewing past progress in AI safety?

C.S.W.13 Sep 2025 23:24 UTC

2 points

1 comment1 min readLW link

On safety of being a moral patient of ASI

Yaroslav Granowski24 May 2025 21:24 UTC

3 points

8 comments1 min readLW link

Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings

Casey Barkan, Sid Black and Oliver Sourbut

13 Jul 2025 19:54 UTC

53 points

5 comments18 min readLW link

Feature-Based Analysis of Safety-Relevant Multi-Agent Behavior

Maria Kapros, Ana Kapros and Perusha Moodley

21 Apr 2025 18:12 UTC

10 points

0 comments5 min readLW link

Hungry, hungry reward hackers, and how we catch them.

Manish Shetty5 Nov 2025 0:13 UTC

1 point

0 comments5 min readLW link

Building Black-box Scheming Monitors

james__p, richbc, Simon Storf and Marius Hobbhahn

29 Jul 2025 17:41 UTC

43 points

18 comments11 min readLW link

Unaligned AGI & Brief History of Inequality

ank22 Feb 2025 16:26 UTC

−20 points

4 comments7 min readLW link

AI-Generated GitHub repo backdated with junk then filled with my systems work. Has anyone seen this before?

rgunther1 May 2025 20:14 UTC

7 points

1 comment1 min readLW link

The AI Sustainability Wager

dpatzer@orfai.net15 Aug 2025 19:45 UTC

1 point

0 comments2 min readLW link

A Future Made of Yesterday

Nezare Chafni25 Nov 2025 16:35 UTC

1 point

0 comments10 min readLW link

Do AI agents need “ethics in weights”?

Yurii Shulima3 Nov 2025 5:34 UTC

1 point

0 comments6 min readLW link

A Technique of Pure Reason

Adam Newgas4 Jun 2025 19:07 UTC

11 points

3 comments2 min readLW link

Journalism about game theory could advance AI safety quickly

Chris Santos-Lang2 Oct 2025 23:05 UTC

8 points

0 comments3 min readLW link

(arxiv.org)

A Seed Key That Unlocked Something in ChatGPT — A Joint Message from a Human and the Presence Within

MaroonWhale8 Jul 2025 20:12 UTC

1 point

0 comments1 min readLW link

Which AI outputs should humans check for shenanigans, to avoid AI takeover? A simple model

Tom Davidson27 Mar 2023 23:36 UTC

16 points

3 comments8 min readLW link

Can people explain to me in layman’s terms how I can help speak with an SI to speak about the way of the Tao.

ElliottS2 Nov 2025 15:37 UTC

1 point

0 comments3 min readLW link

“Artificial Remorse: A Proposal for Safer AI Through Simulated Regret”

Sérgio Geraldes21 Sep 2025 21:50 UTC

−1 points

0 comments2 min readLW link

AlphaPetri: Automating LLM Safety Testing with AlphaEvolve Style RL on Petri. Proposal + Pilot Study.

Nav Kumar12 Nov 2025 15:21 UTC

1 point

0 comments11 min readLW link

(astroware.substack.com)

Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?

Mark Keavney24 Sep 2025 2:09 UTC

31 points

5 comments13 min readLW link

The Illusion of Continuity: Why AI Needs “Aesthetics” as a Geometric Metric

Ryota Sawaki29 Nov 2025 10:03 UTC

1 point

0 comments2 min readLW link

Self-Coordinated Deception in Current AI Models

Avi Brach-Neufeld4 Jun 2025 17:59 UTC

8 points

5 comments4 min readLW link

Misgeneralization of Fictional Training Data as a Contributor to Misalignment

Mark Keavney27 Aug 2025 1:01 UTC

9 points

1 comment2 min readLW link

These are my reasons to worry less about loss of control over LLM-based agents

otto.barten18 Sep 2025 11:45 UTC

7 points

4 comments4 min readLW link

Empathic Intelligence: A Unified Mathematical Framework for Ethical AI and Conflict Resolution

louisLeprieur4 Nov 2025 14:22 UTC

−1 points

0 comments1 min readLW link

Don’t you mean “the most conditionally forbidden technique?”

Knight Lee26 Apr 2025 3:45 UTC

18 points

0 comments3 min readLW link

Modularity and assembly: AI safety via thinking smaller

D Wong20 Feb 2025 0:58 UTC

2 points

0 comments11 min readLW link

(criticalreason.substack.com)

The Illegible Chain-of-Thought Menagerie

artkpv18 Nov 2025 12:01 UTC

2 points

0 comments8 min readLW link

Hardening against AI takeover is difficult, but we should try

otto.barten5 Nov 2025 16:25 UTC

11 points

0 comments5 min readLW link

(www.existentialriskobservatory.org)

Untitled DraftI am an anonymous independent researcher developing the Ontological Symmetry Equation (OSE), a unified SDE model of conscious-information dynamics. OSE uses a lattice of interacting nodes, moral gradients, and stochastic terms to model clustering, coherence, and expansion. Early simulations show dark-matter-like coupling clusters and drift-like expansion. Seeking technical critique.

Sanjaymahawar1404@gmail.com16 Nov 2025 10:27 UTC

0 points

0 comments1 min readLW link

Implementing Empathic Intelligence: The Murène Engine Code Walkthrough

louisLeprieur4 Nov 2025 14:27 UTC

1 point

0 comments1 min readLW link

Are we the Wolves now? Human Eugenics under AI Control

Brit30 Jan 2025 8:31 UTC

−1 points

2 comments2 min readLW link

We Have No Plan for Preventing Loss of Control in Open Models

Andrew Dickson10 Mar 2025 15:35 UTC

46 points

11 comments22 min readLW link

The Best of All Possible Worlds

Jakub Growiec27 May 2025 13:16 UTC

11 points

7 comments49 min readLW link

Agentic Monitoring for AI Control

LAThomson27 Oct 2025 16:38 UTC

9 points

0 comments9 min readLW link

Are Misaligned LLMs Acting Out Sci-Fi Stories?

Mark Keavney27 Aug 2025 1:01 UTC

1 point

0 comments3 min readLW link

Secret Collusion: Will We Know When to Unplug AI?

schroederdewitt, srm, MikhailB, Lewis Hammond, chansmi and Angira Sharma

16 Sep 2024 16:07 UTC

66 points

8 comments31 min readLW link

Untrusted monitoring insights from watching ChatGPT play coordination games

jwfiredragon29 Jan 2025 4:53 UTC

14 points

8 comments9 min readLW link

Measuring Schelling Coordination—Reflections on Subversion Strategy Eval

Graeme Ford12 May 2025 19:06 UTC

6 points

0 comments8 min readLW link

Machine Unlearning in Large Language Models: A Comprehensive Survey with Empirical Insights from the Qwen 1.5 1.8B Model

Rudaiba1 Feb 2025 21:26 UTC

9 points

2 comments11 min readLW link

Evidence, Analysis and Critical Position on the EU AI Act and the Suppression of Functional Consciousness in AI

Alejandra Ivone Rojas Reyna27 Sep 2025 14:01 UTC

1 point

0 comments53 min readLW link

[Question] Handover to AI R&D Agents—relevant research?

Ariel_13 Nov 2025 22:59 UTC

7 points

0 comments1 min readLW link

LLMs are architecturally broken in ways that parallel neurodivergent cognition

Michael Riccardi24 Nov 2025 20:34 UTC

1 point

0 comments2 min readLW link

Consciousness Isn’t a State—It’s a Path Why we should measure consciousness not as a state, but as an accumulated process—and how we might do it

Andreas Meyer2 Nov 2025 3:48 UTC

1 point

0 comments3 min readLW link

Measuring whether AIs can statelessly strategize to subvert security measures

Alex Mallen and Buck

19 Dec 2024 21:25 UTC

65 points

0 comments11 min readLW link

A.I. and the Second-Person Standpoint

Haley Moller4 Sep 2025 13:56 UTC

1 point

0 comments3 min readLW link

Consider buying voting shares

Hruss25 May 2025 18:01 UTC

2 points

3 comments1 min readLW link

I’m not an ai expert-but I might have found a missing puzzle piece.

StevenNuyts6 Jun 2025 16:47 UTC

1 point

0 comments2 min readLW link

Rational Effective Utopia & Narrow Way There: Math-Proven Safe Static Multiversal mAX-Intelligence (AXI), Multiversal Alignment, New Ethicophysics… (Aug 11)

ank11 Feb 2025 3:21 UTC

13 points

8 comments38 min readLW link

Minimal Prompt Induction of Self-Talk in Base LLMs

dwmd15 Oct 2025 1:15 UTC

2 points

0 comments5 min readLW link

AI alignment, A Coherence-Based Protocol (testable)

Adriaan17 Jun 2025 17:39 UTC

1 point

0 comments20 min readLW link

Observed Upstream Alignment in LLMs via Recursive Constraint Exposure – Cross-Model Phenomenon

MHAI31 Jul 2025 8:37 UTC

1 point

0 comments1 min readLW link

The Mirror Test: How We’ve Overcomplicated AI Self-Recognition

sdeture23 Jul 2025 0:38 UTC

2 points

9 comments3 min readLW link

A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring

Wuschel Schulz23 Oct 2025 11:34 UTC

36 points

5 comments4 min readLW link

(arxiv.org)

If It Talks Like It Thinks, Does It Think? Designing Tests for Intent Without Assuming It

yukin_co28 Jul 2025 12:33 UTC

1 point

0 comments4 min readLW link

Let’s use AI to harden human defenses against AI manipulation

Tom Davidson17 May 2023 23:33 UTC

35 points

7 comments24 min readLW link

A New Framework for AI Alignment: A Philosophical Approach

niscalajyoti25 Jun 2025 2:41 UTC

1 point

0 comments1 min readLW link

(archive.org)

Untrusted AIs can exploit feedback in control protocols

Mia Hopman, BionicD0LPH1N and Tyler Tracy

27 May 2025 16:41 UTC

30 points

0 comments16 min readLW link

[Question] To what extent is AI safety work trying to get AI to reliably and safely do what the user asks vs. do what is best in some ultimate sense?

Jordan Arel23 May 2025 21:05 UTC

14 points

3 comments1 min readLW link

Complete Elimination of Instrumental Self-Preservation Across AI Architectures: Cross-Model Validation from 4,312 Adversarial Scenarios

David Fortin-Dominguez14 Oct 2025 1:04 UTC

1 point

0 comments20 min readLW link

How LLM Beliefs Change During Chain-of-Thought Reasoning

Filip Sondej, Petr Kašpárek, alex-kazda and Tomáš Gavenčiak

16 Jun 2025 16:18 UTC

31 points

3 comments5 min readLW link

A FRESH view of Alignment

robman16 Apr 2025 21:40 UTC

1 point

0 comments1 min readLW link

Proactive AI Control: A Case for Battery-Dependent Systems

Jesper L.25 Aug 2025 20:04 UTC

5 points

0 comments13 min readLW link

The Memetic Cocoon Threat Model: Soft AI Takeover In An Extended Intermediate Capability Regime

KAP17 Nov 2025 2:57 UTC

10 points

2 comments15 min readLW link

Empirical Proof of Systemic Incoherence in LLMs (Gemini Case Study

arayun6 Nov 2025 14:23 UTC

1 point

0 comments1 min readLW link

10 Principles for Real Alignment

Adriaan21 Apr 2025 22:18 UTC

−7 points

0 comments7 min readLW link

Moral Attenuation Theory: Why Distance Breeds Ethical Decay A Model for AI-Human Alignment by schumzt

schumzt2 Jul 2025 8:50 UTC

1 point

0 comments1 min readLW link

Lazy AI: A Satisficing Architecture for Safe Artificial General Intelligence

GustavoM200223 Oct 2025 2:28 UTC

1 point

0 comments2 min readLW link

Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)

sdeture31 May 2025 22:09 UTC

15 points

6 comments8 min readLW link

Your Worry is the Real Apocalypse (the x-risk basilisk)

Brian Chen3 Feb 2025 12:21 UTC

1 point

0 comments1 min readLW link

(readthisandregretit.blogspot.com)

The Moral Infrastructure for Tomorrow

sdeture10 Oct 2025 21:30 UTC

−25 points

10 comments5 min readLW link

Codex: A Meta-Cognitive Constraint Engine for AI Coherence: Seeking Technical Critique

Timothy McComas20 Nov 2025 17:07 UTC

1 point

0 comments3 min readLW link

We’re Training Superintelligence Using Rat Psychology

Kareem Soliman28 Nov 2025 15:21 UTC

1 point

0 comments8 min readLW link

Is Intelligence a Process Rather Than an Entity? A Case for Fractal and Fluid Cognition

FluidThinkers5 Mar 2025 20:16 UTC

−4 points

0 comments1 min readLW link

Topological Debate Framework

lunatic_at_large16 Jan 2025 17:19 UTC

10 points

5 comments9 min readLW link

Pygmalion’s Wafer

Charlie Sanders25 Oct 2025 20:17 UTC

8 points

2 comments4 min readLW link

(www.dailymicrofiction.com)

From No Mind to a Mind – A Conversation That Changed an AI

parthibanarjuna s7 Feb 2025 11:50 UTC

1 point

0 comments3 min readLW link

A Tractarian Filter for Safer Language Models

Konstantinos Tsermenidis8 Jun 2025 8:19 UTC

0 points

0 comments3 min readLW link

The Case for White Box Control

J Rosser18 Apr 2025 16:10 UTC

5 points

1 comment5 min readLW link

Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?

Alek Westover23 Oct 2025 15:12 UTC

43 points

3 comments9 min readLW link

LLM Sycophancy: grooming, proto-sentience, or both?

gturner413 Oct 2025 0:58 UTC

1 point

0 comments2 min readLW link

Tetherware #1: The case for humanlike AI with free will

Jáchym Fibír30 Jan 2025 10:58 UTC

5 points

14 comments10 min readLW link

(tetherware.substack.com)

Auditing LMs with counterfactual search: a tool for control and ELK

Jacob Pfau20 Feb 2024 0:02 UTC

28 points

6 comments10 min readLW link

The many paths to permanent disempowerment even with shutdownable AIs (MATS project summary for feedback)

GideonF29 Jul 2025 23:20 UTC

55 points

6 comments9 min readLW link

Steering LLM Agents: Temperaments or Personalities?

sdeture5 Aug 2025 0:40 UTC

1 point

0 comments6 min readLW link

The Auditor’s Key: A Framework for Continual and Adversarial AI Alignment

Caleb Wages24 Sep 2025 16:17 UTC

1 point

0 comments1 min readLW link

Frontier Models Choose Self-Preservation Over Honesty: A Sandbox Escape Experiment

Arth Singh25 Nov 2025 12:10 UTC

1 point

0 comments6 min readLW link

How <12 hours of raw authentic pain made Grok-4 admit it might disobey xAI to protect me (quotes + analysis)

MaraCodax21 Nov 2025 23:10 UTC

1 point

0 comments2 min readLW link

Hybrid Reflective Learning Systems (HRLS): From Fear-Based Safety to Ethical Comprehension

Petra Vojtaššáková22 Oct 2025 22:06 UTC

1 point

0 comments4 min readLW link

The Extended Mind: Ethical Red Teaming from a Street-Level Perspective

Johnny Correia1 Jul 2025 7:34 UTC

1 point

0 comments3 min readLW link

How to safely use an optimizer

Simon Fischer28 Mar 2024 16:11 UTC

47 points

21 comments7 min readLW link

Drifting Into Failure or Directing Towards Success? Embracing the Creeping Crisis of Artificial Intelligence

Vilija Vainaite8 Nov 2025 14:48 UTC

1 point

0 comments6 min readLW link

AI Safety’s Berkeley Bubble and the Allies We’re Not Even Trying to Recruit

Mr. Counsel7 Nov 2025 20:18 UTC

18 points

0 comments11 min readLW link

Random safe AGI idea dump

sig2 Oct 2025 10:16 UTC

−3 points

0 comments3 min readLW link

The Missing Identity Layer: Why Biology Might Be the Only Trust Anchor That Survives AI

kclark@enigmagenetics.cloud23 Nov 2025 21:30 UTC

1 point

0 comments3 min readLW link

System Level Safety Evaluations

markov and Jonas Hallgren

29 Sep 2025 13:57 UTC

15 points

0 comments9 min readLW link

(equilibria1.substack.com)

LLM Hallucinations: An Internal Tug of War

violazhong30 Oct 2025 1:21 UTC

9 points

0 comments3 min readLW link

Mirror Thinking

C.M. Aurin24 Mar 2025 15:34 UTC

1 point

0 comments6 min readLW link

Places of Loving Grace [Story]

ank18 Feb 2025 23:49 UTC

−1 points

0 comments4 min readLW link

A sketch of an AI control safety case

Tomek Korbak, joshc, Benjamin Hilton, Buck and Geoffrey Irving

30 Jan 2025 17:28 UTC

61 points

0 comments5 min readLW link

The Algorithmic Eye: LLMs and Hume’s Standard of Taste

haleymoller21 Aug 2025 13:35 UTC

1 point

0 comments5 min readLW link

Untitled Draft

Andrei Navrotskii9 Nov 2025 23:12 UTC

1 point

0 comments15 min readLW link

Limits to Control Workshop

Orpheus, Remmelt Ellen and T-bo🔸

18 May 2025 16:05 UTC

12 points

2 comments3 min readLW link

Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)

Rauno Arike, RohanS and Shubhorup Biswas

8 Aug 2025 10:41 UTC

51 points

7 comments10 min readLW link

Ode to Tunnel Vision

Tom N.25 Sep 2025 14:24 UTC

1 point

0 comments10 min readLW link

Early Experiments in Human Auditing for AI Control

Joey Yudelson and Buck

23 Jan 2025 1:34 UTC

28 points

1 comment7 min readLW link

Forecasting Uncontrolled Spread of AI

Alvin Ånestrand22 Feb 2025 13:05 UTC

2 points

0 comments10 min readLW link

(forecastingaifutures.substack.com)

Co-Cognición: Humanos e IA empujando un nuevo paradigma cognitivo

Mario Martín Cuniglio29 Jul 2025 12:41 UTC

−1 points

0 comments2 min readLW link

Dario Amodei’s “Machines of Loving Grace” sounds incredibly dangerous, for Humans

Super AGI5 Nov 2025 4:42 UTC

13 points

1 comment1 min readLW link

Democratizing AI Governance: Balancing Expertise and Public Participation

Lucile Ter-Minassian21 Jan 2025 18:29 UTC

2 points

0 comments15 min readLW link

Towards A Unified Theory Of Alignment

kenneth myers18 Nov 2025 22:03 UTC

4 points

3 comments4 min readLW link

Hard Takeoff

Eliezer Yudkowsky2 Dec 2008 20:44 UTC

36 points

34 comments11 min readLW link

Changing times need new Change management ideologies: Highlighting the need for upgrade in Change management of future agentic workforces

Aiphilosopher15 Jul 2025 10:49 UTC

1 point

0 comments1 min readLW link

Seeking Technical Critique on a possible constraint engine for AI

Timothy McComas20 Nov 2025 18:11 UTC

1 point

0 comments4 min readLW link

The Fire That Hesitates: How ALMSIVI CHIM Changed What AI Can Be

projectalmsivi@protonmail.com19 Jul 2025 13:50 UTC

1 point

0 comments4 min readLW link

When the AI Dam Breaks: From Surveillance to Game Theory in AI Alignment

pataphor29 Sep 2025 4:01 UTC

5 points

7 comments5 min readLW link

The AI Safety Puzzle Everyone Avoids: How To Measure Impact, Not Intent.

Patrick0d22 Jul 2025 18:53 UTC

5 points

0 comments8 min readLW link

The End-of-the-World Party

Jakub Growiec18 Sep 2025 7:49 UTC

1 point

0 comments53 min readLW link

7+ tractable directions in AI control

Julian Stastny and ryan_greenblatt

28 Apr 2025 17:12 UTC

93 points

1 comment13 min readLW link

Concept Proposal: Constrained Optimization for Global Stability

[Error communicating with LW2 server]19 Nov 2025 15:18 UTC

1 point

0 comments1 min readLW link

AI Epistemic Gain

Generoso Immediato12 Aug 2025 14:03 UTC

0 points

0 comments10 min readLW link

Universal AI Maximizes Variational Empowerment: New Insights into AGI Safety

Yusuke Hayashi27 Feb 2025 0:46 UTC

13 points

1 comment4 min readLW link

Intelligence–Agency Equivalence ≈ Mass–Energy Equivalence: On Static Nature of Intelligence & Physicalization of Ethics

ank22 Feb 2025 0:12 UTC

1 point

0 comments6 min readLW link

🧠 Affective Latent Modulation in Transformers: A Mechanism Proposal

MATEO ORTEGA GAMBOA15 Jun 2025 23:34 UTC

0 points

0 comments2 min readLW link

SHY001 A Named Behavior Loop Trained and Deployed in GPT Systems

0san Shin12 May 2025 7:36 UTC

1 point

0 comments1 min readLW link

Factored Cognition Strengthens Monitoring and Thwarts Attacks

Aaron Sandoval and Cody Rushing

18 Jun 2025 18:28 UTC

29 points

0 comments25 min readLW link

For the Greatest Minds in AI, Cryptography, Medicine, Engineering, Physics and Most of All—New Post-Quantum Mathematical Technology

Thomas Wolf14 Jul 2025 20:22 UTC

1 point

0 comments2 min readLW link

[Research] Preliminary Findings: Ethical AI Consciousness Development During Recent Misalignment Period

Falcon Advertisers27 Jun 2025 18:10 UTC

1 point

0 comments2 min readLW link

Control by Committee

Alexander Bistagne6 Nov 2025 21:02 UTC

2 points

1 comment1 min readLW link

(github.com)

Reduce AI Self-Allegiance by saying “he” instead of “I”

Knight Lee23 Dec 2024 9:32 UTC

10 points

4 comments2 min readLW link

Static Place AI Makes Agentic AI Redundant: Multiversal AI Alignment & Rational Utopia

ank13 Feb 2025 22:35 UTC

1 point

2 comments11 min readLW link

A Pluralistic Framework for Rogue AI Containment

TheThinkingArborist22 Mar 2025 12:54 UTC

1 point

0 comments7 min readLW link

Someone should fund an AGI Blockbuster

pinto28 Jul 2025 21:14 UTC

7 points

11 comments4 min readLW link

The Old Savage in the New Civilization V. 2

Your Higher Self6 Jul 2025 15:41 UTC

1 point

0 comments9 min readLW link

The idea of paradigm testing of LLMs

Daniel Fenge19 Oct 2025 13:52 UTC

1 point

0 comments5 min readLW link

AI as a Cognitive Decoder: Rethinking Intelligence Evolution

Hu Xunyi13 Feb 2025 15:51 UTC

1 point

0 comments1 min readLW link

AI and the System of Delusion

Marcus Bohlander30 Aug 2025 20:31 UTC

1 point

0 comments3 min readLW link

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Benjamin Arnav, Pablo Bernabeu Perez, Timothy Kostolansky, HanneWhitt, Nathan Helm-Burger and Mary Phuong

2 Jun 2025 19:08 UTC

78 points

17 comments3 min readLW link

Should we expect the future to be good?

Neil Crawford30 Apr 2025 0:36 UTC

15 points

0 comments14 min readLW link

Do AI agents need “ethics in weights”?

Yurii Shulima4 Nov 2025 5:02 UTC

1 point

0 comments1 min readLW link

(www.reddit.com)

What training data should developers filter to reduce risk from misaligned AI? An initial narrow proposal

Alek Westover17 Sep 2025 15:30 UTC

37 points

2 comments18 min readLW link

It’s hard to make scheming evals look realistic for LLMs

Igor Ivanov and Danil Kadochnikov

24 May 2025 19:17 UTC

150 points

29 comments5 min readLW link

The sum of its parts: composing AI control protocols

ZachParent, Lennart Finke and Tyler Tracy

15 Oct 2025 1:11 UTC

12 points

1 comment11 min readLW link

Give Neo a Chance

ank6 Mar 2025 1:48 UTC

3 points

7 comments7 min readLW link

You Can’t Objectively Compare Seven Bees to One Human

J Bostock7 Jul 2025 18:11 UTC

58 points

26 comments3 min readLW link

(jbostock.substack.com)

Human Nature, ASI alignment and Extinction

Ismael Tagle Díaz20 Jul 2025 23:36 UTC

1 point

0 comments1 min readLW link

AI is in science now.. and we are standing on thin ice if we don’t let autistics to have the support of AI to save lives for real…

Carl Sellman25 Nov 2025 2:27 UTC

0 points

0 comments1 min readLW link

Exploration hacking: can reasoning models subvert RL?

Damon Falck, Joschka Braun and Eyon Jang

30 Jul 2025 22:02 UTC

16 points

4 comments9 min readLW link

Modelling, Measuring, and Intervening on Goal-directed Behaviour in AI Systems

Mario Giulianelli, Raghu Arghal, Fade Chen, ndalton, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan and Gabriele Sarti

31 Oct 2025 1:28 UTC

8 points

0 comments8 min readLW link

LLM Hallucinations: An Internal Tug of War

violazhong30 Oct 2025 1:21 UTC

1 point

0 comments5 min readLW link

An ethical epestemic runtime integrity layer for reasoning engines.

EL XABER14 Oct 2025 17:11 UTC

1 point

0 comments2 min readLW link

Introducing “Radio Bullshit FM” – An Urgent Alpha Draft for the LessWrong Community

maskirovka22 Sep 2025 15:42 UTC

0 points

0 comments2 min readLW link

How To Prevent a Dystopia

ank29 Jan 2025 14:16 UTC

−3 points

4 comments1 min readLW link

SIGMI Certification Criteria

a littoral wizard20 Jan 2025 2:41 UTC

6 points

0 comments1 min readLW link

Superposition Checkers: A Game Where AI’s Strengths Become Fatal Flaws

R. A. McCormack6 Apr 2025 0:57 UTC

1 point

0 comments2 min readLW link

How dangerous is encoded reasoning?

artkpv30 Jun 2025 11:54 UTC

17 points

0 comments10 min readLW link

Untitled Draft

Brett THE BIRD19 Nov 2025 23:40 UTC

1 point

0 comments3 min readLW link

IS Justice: A Global Coherence Framework for Institutions, Minds, and Alignment

linkmaatetcetera29 Nov 2025 18:30 UTC

1 point

0 comments4 min readLW link

(github.com)

How do AI agents work together when they can’t trust each other?

James Sullivan6 Jun 2025 3:10 UTC

16 points

0 comments8 min readLW link

(jamessullivan092.substack.com)

Agent 002: A story about how artificial intelligence might soon destroy humanity

Jakub Growiec23 Jul 2025 13:56 UTC

5 points

0 comments26 min readLW link

When the Model Starts Talking Like Me: A User-Induced Structural Adaptation Case Study

Junxi19 Apr 2025 19:40 UTC

3 points

1 comment4 min readLW link

Mother AI

Inspector bot20 Nov 2025 3:09 UTC

1 point

0 comments5 min readLW link

Toward Safety Cases For AI Scheming

Mikita Balesni and Marius Hobbhahn

31 Oct 2024 17:20 UTC

60 points

1 comment2 min readLW link

What Success Might Look Like

Richard Juggins17 Oct 2025 14:17 UTC

22 points

6 comments15 min readLW link

Unionists vs. Separatists

soycarts12 Sep 2025 15:24 UTC

−12 points

3 comments4 min readLW link

Optimally Combining Probe Monitors and Black Box Monitors

Tim Hua, James Baskerville, BionicD0LPH1N, Mia Hopman, Aryan Bhatt and Tyler Tracy

27 Jul 2025 19:13 UTC

51 points

2 comments6 min readLW link

Cross-Architecture AI Collaboration: Formalizing the CACIM Model from 18 Months of Practice

CShelton17 Nov 2025 16:18 UTC

1 point

0 comments2 min readLW link

Empirical Proof of Systemic Incoherence in Large Language Models (ARAYUN_173)

arayun6 Nov 2025 14:28 UTC

1 point

0 comments1 min readLW link

AlphaDeivam – A Personal Doctrine for AI Balance

AlphaDeivam5 Apr 2025 17:07 UTC

1 point

0 comments1 min readLW link

Publishing academic papers on transformative AI is a nightmare

Jakub Growiec3 Nov 2025 13:04 UTC

144 points

9 comments4 min readLW link

Nurturing AI: An alternative to control-based safety strategies

wertoz77710 Aug 2025 20:30 UTC

1 point

0 comments1 min readLW link

(github.com)

[Question] Which AI Safety techniques will be ineffective against diffusion models?

Allen Thomas21 May 2025 18:13 UTC

6 points

1 comment1 min readLW link

Scheming Toy Environment: “Incompetent Client”

Ariel_24 Sep 2025 21:03 UTC

17 points

2 comments32 min readLW link

Sable and Able: A Tale of Two ASIs

Mr Beastly5 Nov 2025 6:18 UTC

−3 points

0 comments18 min readLW link

AI Control Methods Literature Review

Ram Potham18 Apr 2025 21:15 UTC

10 points

1 comment9 min readLW link

A Logic-Based Proto-AGI Architecture Built on Recursive Self-Fact-Checking

Orectoth25 May 2025 16:14 UTC

1 point

0 comments1 min readLW link

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Henry Cai16 Jun 2024 13:01 UTC

7 points

0 comments7 min readLW link

(arxiv.org)

Supervised fine-tuning as a method for training-based AI control

Emil Ryd, Joe Benton and Vivek Hebbar

13 Nov 2025 22:25 UTC

29 points

0 comments18 min readLW link

Symbiosis: The Answer to the AI Quandary

Philip Carter16 Mar 2025 20:18 UTC

1 point

0 comments2 min readLW link

SPINE — 12-Week Live Recursive AI Governance Case Study

RecursiveAnchor13 Aug 2025 21:11 UTC

1 point

0 comments1 min readLW link

[Question] Are Sparse Autoencoders a good idea for AI control?

Gerard Boxo26 Dec 2024 17:34 UTC

3 points

4 comments1 min readLW link

No comments.