Deceptive Alignment

TagLast edit: 18 Oct 2024 0:02 UTC by Matt Putz

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI. (The term scheming is sometimes used for this phenomenon.)

Deceptive Alignment

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

5 Jun 2019 20:16 UTC

118 points

20 comments17 min readLW link

Does SGD Produce Deceptive Alignment?

Mark Xu6 Nov 2020 23:48 UTC

96 points

9 comments16 min readLW link

AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt and Kshitij Sachan

13 Dec 2023 15:51 UTC

239 points

24 comments10 min readLW link 4 reviews

How likely is deceptive alignment?

evhub30 Aug 2022 19:34 UTC

105 points

28 comments60 min readLW link

New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

Joe Carlsmith15 Nov 2023 17:16 UTC

82 points

28 comments30 min readLW link 1 review

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

18 Dec 2024 17:19 UTC

489 points

75 comments10 min readLW link

Catching AIs red-handed

ryan_greenblatt and Buck

5 Jan 2024 17:43 UTC

113 points

27 comments17 min readLW link

Many arguments for AI x-risk are wrong

TurnTrout5 Mar 2024 2:31 UTC

169 points

87 comments12 min readLW link

Order Matters for Deceptive Alignment

DavidW15 Feb 2023 19:56 UTC

57 points

19 comments7 min readLW link

A Problem to Solve Before Building a Deception Detector

Eleni Angelou and lewis smith

7 Feb 2025 19:35 UTC

76 points

12 comments14 min readLW link

Why Aligning an LLM is Hard, and How to Make it Easier

RogerDearnaley23 Jan 2025 6:44 UTC

34 points

3 comments4 min readLW link

Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor

RogerDearnaley9 Jan 2024 20:42 UTC

48 points

8 comments36 min readLW link

Interpreting the Learning of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC

30 points

14 comments9 min readLW link

Deceptive AI ≠ Deceptively-aligned AI

Steven Byrnes7 Jan 2024 16:55 UTC

96 points

19 comments6 min readLW link

The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC

638 points

188 comments16 min readLW link

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub, Nicholas Schiefer, Carson Denison and Ethan Perez

8 Aug 2023 1:30 UTC

322 points

30 comments18 min readLW link 1 review

Announcing Apollo Research

Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni and Jérémy Scheurer

30 May 2023 16:17 UTC

217 points

11 comments8 min readLW link

Deep Deceptiveness

So8res21 Mar 2023 2:51 UTC

268 points

60 comments14 min readLW link 1 review

Testing for Scheming with Model Deletion

Guive7 Jan 2025 1:54 UTC

59 points

21 comments21 min readLW link

(guive.substack.com)

Counting arguments provide no evidence for AI doom

Nora Belrose and Quintin Pope

27 Feb 2024 23:03 UTC

101 points

188 comments14 min readLW link

Two Tales of AI Takeover: My Doubts

Violet Hour5 Mar 2024 15:51 UTC

30 points

8 comments29 min readLW link

Superintelligence’s goals are likely to be random

Mikhail Samin13 Mar 2025 22:41 UTC

6 points

6 comments5 min readLW link

Turning up the Heat on Deceptively-Misaligned AI

J Bostock7 Jan 2025 0:13 UTC

19 points

16 comments4 min readLW link

AXRP Episode 38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

DanielFilan20 Jan 2025 0:40 UTC

9 points

0 comments16 min readLW link

Simplicity arguments for scheming (Section 4.3 of “Scheming AIs”)

Joe Carlsmith7 Dec 2023 15:05 UTC

10 points

1 comment19 min readLW link

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer and Mikita Balesni

5 Dec 2024 22:11 UTC

210 points

24 comments7 min readLW link

What’s the short timeline plan?

Marius Hobbhahn2 Jan 2025 14:59 UTC

358 points

49 comments23 min readLW link

AIs Will Increasingly Fake Alignment

Zvi24 Dec 2024 13:00 UTC

89 points

0 comments52 min readLW link

(thezvi.wordpress.com)

The Case for Mixed Deployment

Cleo Nardo11 Sep 2025 6:14 UTC

34 points

4 comments4 min readLW link

The case for ensuring that powerful AIs are controlled

ryan_greenblatt and Buck

24 Jan 2024 16:11 UTC

264 points

73 comments28 min readLW link

Empirical work that might shed light on scheming (Section 6 of “Scheming AIs”)

Joe Carlsmith11 Dec 2023 16:30 UTC

8 points

0 comments21 min readLW link

Deceptive Alignment and Homuncularity

Oliver Sourbut and TurnTrout

16 Jan 2025 13:55 UTC

26 points

12 comments22 min readLW link

Prospects for studying actual schemers

ryan_greenblatt and Julian Stastny

19 Sep 2025 14:11 UTC

40 points

0 comments58 min readLW link

A “weak” AGI may attempt an unlikely-to-succeed takeover

RobertM28 Jun 2023 20:31 UTC

56 points

17 comments3 min readLW link

An information-theoretic study of lying in LLMs

Annah and Guillaume Corlouer

2 Aug 2024 10:06 UTC

17 points

0 comments4 min readLW link

I replicated the Anthropic alignment faking experiment on other models, and they didn’t fake alignment

Aleksandr Kedrik and Igor Ivanov

30 May 2025 18:57 UTC

31 points

0 comments2 min readLW link

How training-gamers might function (and win)

Vivek Hebbar11 Apr 2025 21:26 UTC

110 points

5 comments13 min readLW link

Mistral Large 2 (123B) seems to exhibit alignment faking

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Cameron Berg, Judd Rosenblatt, Mike Vaiana and Trent Hodgeson

27 Mar 2025 15:39 UTC

81 points

4 comments13 min readLW link

Two proposed projects on abstract analogies for scheming

Julian Stastny4 Jul 2025 16:03 UTC

48 points

0 comments3 min readLW link

Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?

Bogdan Ionut Cirstea26 Nov 2024 9:58 UTC

9 points

0 comments1 min readLW link

(arxiv.org)

Paper: Tell, Don’t Show- Declarative facts influence how LLMs generalize

Owain_Evans and AlexMeinke

19 Dec 2023 19:14 UTC

45 points

4 comments6 min readLW link

(arxiv.org)

Toward Safety Cases For AI Scheming

Mikita Balesni and Marius Hobbhahn

31 Oct 2024 17:20 UTC

60 points

1 comment2 min readLW link

[Question] Deceptive AI vs. shifting instrumental incentives

Aryeh Englander26 Jun 2023 18:09 UTC

7 points

2 comments3 min readLW link

Self-dialogue: Do behaviorist rewards make scheming AGIs?

Steven Byrnes13 Feb 2025 18:39 UTC

43 points

1 comment46 min readLW link

Paul Christiano on Dwarkesh Podcast

ESRogs3 Nov 2023 22:13 UTC

19 points

0 comments1 min readLW link

(www.dwarkeshpatel.com)

3 levels of threat obfuscation

HoldenKarnofsky2 Aug 2023 14:58 UTC

69 points

14 comments7 min readLW link

Will alignment-faking Claude accept a deal to reveal its misalignment?

ryan_greenblatt and Kyle Fish

31 Jan 2025 16:49 UTC

208 points

28 comments12 min readLW link

Apollo Research is hiring evals and interpretability engineers & scientists

Marius Hobbhahn4 Aug 2023 10:54 UTC

25 points

0 comments2 min readLW link

Incentives and Selection: A Missing Frame From AI Threat Discussions?

DragonGod26 Feb 2023 1:18 UTC

11 points

16 comments2 min readLW link

“Alignment Faking” frame is somewhat fake

Jan_Kulveit20 Dec 2024 9:51 UTC

156 points

13 comments6 min readLW link

Paper: On measuring situational awareness in LLMs

Owain_Evans, Daniel Kokotajlo, Mikita Balesni, Tomek Korbak, Asa Cooper Stickland, Meg and Maximilian Kaufmann

4 Sep 2023 12:54 UTC

109 points

17 comments5 min readLW link

(arxiv.org)

[Question] Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?

David Scott Krueger (formerly: capybaralet)4 Sep 2024 12:40 UTC

19 points

7 comments1 min readLW link

We should start looking for scheming “in the wild”

Marius Hobbhahn6 Mar 2025 13:49 UTC

91 points

4 comments5 min readLW link

Evaluations project @ ARC is hiring a researcher and a webdev/engineer

Beth Barnes9 Sep 2022 22:46 UTC

99 points

7 comments10 min readLW link

Sticky goals: a concrete experiment for understanding deceptive alignment

evhub2 Sep 2022 21:57 UTC

39 points

13 comments3 min readLW link

Environments for Measuring Deception, Resource Acquisition, and Ethical Violations

Dan H7 Apr 2023 18:40 UTC

51 points

2 comments2 min readLW link

(arxiv.org)

When does training a model change its goals?

Vivek Hebbar and ryan_greenblatt

12 Jun 2025 18:43 UTC

71 points

2 comments15 min readLW link

On Anthropic’s Sleeper Agents Paper

Zvi17 Jan 2024 16:10 UTC

54 points

5 comments36 min readLW link

(thezvi.wordpress.com)

How will we update about scheming?

ryan_greenblatt6 Jan 2025 20:21 UTC

174 points

20 comments37 min readLW link

Our new video about goal misgeneralization, plus an apology

Writer14 Jan 2025 14:07 UTC

33 points

0 comments7 min readLW link

(youtu.be)

Difficulty classes for alignment properties

Jozdien20 Feb 2024 9:08 UTC

34 points

5 comments2 min readLW link

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Winnie Yang and Jojo Yang

22 Aug 2024 7:32 UTC

23 points

1 comment21 min readLW link

The counting argument for scheming (Sections 4.1 and 4.2 of “Scheming AIs”)

Joe Carlsmith6 Dec 2023 19:28 UTC

10 points

0 comments10 min readLW link

“Behaviorist” RL reward functions lead to scheming

Steven Byrnes23 Jul 2025 16:55 UTC

56 points

5 comments12 min readLW link

Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs

Michaël Trazzi24 Aug 2024 4:30 UTC

55 points

0 comments5 min readLW link

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC

93 points

11 comments2 min readLW link

Monitoring for deceptive alignment

evhub8 Sep 2022 23:07 UTC

135 points

8 comments9 min readLW link

Here’s 18 Applications of Deception Probes

Cleo Nardo, Avi Parrack and jordine

28 Aug 2025 18:59 UTC

38 points

0 comments22 min readLW link

The Defender’s Advantage of Interpretability

Marius Hobbhahn14 Sep 2022 14:05 UTC

41 points

4 comments6 min readLW link

Densing Law of LLMs

Bogdan Ionut Cirstea8 Dec 2024 19:35 UTC

9 points

2 comments1 min readLW link

(arxiv.org)

“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)

Joe Carlsmith29 Nov 2023 16:32 UTC

29 points

1 comment11 min readLW link

For scheming, we should first focus on detection and then on prevention

Marius Hobbhahn4 Mar 2025 15:22 UTC

49 points

7 comments5 min readLW link

Why “training against scheming” is hard

Marius Hobbhahn24 Jun 2025 19:08 UTC

63 points

2 comments12 min readLW link

Notes from a mini-replication of the alignment faking paper

Ben_Snodin4 Jun 2025 11:01 UTC

13 points

5 comments9 min readLW link

(www.bensnodin.com)

[Question] Why is o1 so deceptive?

abramdemski27 Sep 2024 17:27 UTC

183 points

24 comments3 min readLW link

Understanding strategic deception and deceptive alignment

Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and Dan Braun

25 Sep 2023 16:27 UTC

64 points

16 comments7 min readLW link

(www.apolloresearch.ai)

Critiques of the AI control agenda

Jozdien14 Feb 2024 19:25 UTC

48 points

14 comments9 min readLW link

Deceptive Alignment is <1% Likely by Default

DavidW21 Feb 2023 15:09 UTC

89 points

31 comments14 min readLW link 1 review

Trustworthy and untrustworthy models

Olli Järviniemi19 Aug 2024 16:27 UTC

47 points

3 comments8 min readLW link

Training AI agents to solve hard problems could lead to Scheming

Marius Hobbhahn and AlexMeinke

19 Nov 2024 0:10 UTC

61 points

12 comments28 min readLW link

Introducing Alignment Stress-Testing at Anthropic

evhub12 Jan 2024 23:51 UTC

182 points

23 comments2 min readLW link

MetaAI: less is less for alignment.

Cleo Nardo13 Jun 2023 14:08 UTC

71 points

17 comments5 min readLW link

The 80/20 playbook for mitigating AI scheming in 2025

Charbel-Raphaël31 May 2025 21:17 UTC

39 points

2 comments4 min readLW link

Distinguish worst-case analysis from instrumental training-gaming

Olli Järviniemi and Buck

5 Sep 2024 19:13 UTC

43 points

0 comments5 min readLW link

The Sharp Right Turn: sudden deceptive alignment as a convergent goal

avturchin6 Jun 2023 9:59 UTC

38 points

5 comments1 min readLW link

Ten Levels of AI Alignment Difficulty

Sammy Martin3 Jul 2023 20:20 UTC

140 points

24 comments12 min readLW link 1 review

Decision Theory Guarding is Sufficient for Scheming

james.lucassen9 Sep 2025 14:49 UTC

36 points

4 comments2 min readLW link

Corrigibility’s Desirability is Timing-Sensitive

RobertM26 Dec 2024 22:24 UTC

29 points

4 comments3 min readLW link

LLMs Do Not Think Step-by-step In Implicit Reasoning

Bogdan Ionut Cirstea28 Nov 2024 9:16 UTC

11 points

0 comments1 min readLW link

(arxiv.org)

AXRP Episode 39 - Evan Hubinger on Model Organisms of Misalignment

DanielFilan1 Dec 2024 6:00 UTC

41 points

0 comments67 min readLW link

AI Deception: A Survey of Examples, Risks, and Potential Solutions

Simon Goldstein and Peter S. Park

29 Aug 2023 1:29 UTC

54 points

3 comments10 min readLW link

Smoke without fire is scary

Adam Jermyn4 Oct 2022 21:08 UTC

52 points

22 comments4 min readLW link

Misalignments and RL failure modes in the early stage of superintelligence

shu yang29 Jul 2025 18:23 UTC

13 points

0 comments13 min readLW link

Do we want alignment faking?

Florian_Dietz28 Feb 2025 21:50 UTC

7 points

4 comments1 min readLW link

ChatGPT deceives users that it’s cleared its memory when it hasn’t

d_el_ez18 May 2025 15:17 UTC

15 points

10 comments2 min readLW link

Proposal: labs should precommit to pausing if an AI argues for itself to be improved

NickGabs2 Jun 2023 22:31 UTC

3 points

3 comments4 min readLW link

Backdoors have universal representations across large language models

Amirali Abdullah, Narmeen, Dhruv Nathawani and nirmalendu prakash

6 Dec 2024 22:56 UTC

16 points

0 comments16 min readLW link

Framings of Deceptive Alignment

peterbarnett26 Apr 2022 4:25 UTC

32 points

7 comments5 min readLW link

Mesa-Optimization: Explain it like I’m 10 Edition

brook26 Aug 2023 23:04 UTC

20 points

1 comment6 min readLW link

Simple experiments with deceptive alignment

Andreas_Moe15 May 2023 17:41 UTC

7 points

0 comments4 min readLW link

The Meta-Recursive Trap in Newcomb’s Paradox and Millennium Problems

Drew Remmenga3 Jun 2025 12:10 UTC

1 point

0 comments3 min readLW link

Takes on “Alignment Faking in Large Language Models”

Joe Carlsmith18 Dec 2024 18:22 UTC

105 points

7 comments62 min readLW link

Deceptive failures short of full catastrophe.

Alex Lawsen 15 Jan 2023 19:28 UTC

33 points

5 comments9 min readLW link

It’s Owl in the Numbers: Token Entanglement in Subliminal Learning

Alex Loftus, Amir Zur, Kerem Şahin, zfying and Hadas Orgad

6 Aug 2025 22:18 UTC

38 points

7 comments4 min readLW link

Exploration hacking: can reasoning models subvert RL?

Damon Falck, Joschka Braun and Eyon Jang

30 Jul 2025 22:02 UTC

16 points

4 comments9 min readLW link

Trying to measure AI deception capabilities using temporary simulation fine-tuning

alenoach4 May 2023 17:59 UTC

4 points

0 comments7 min readLW link

Inducing Unprompted Misalignment in LLMs

Sam Svenningsen, evhub and Henry Sleight

19 Apr 2024 20:00 UTC

38 points

7 comments16 min readLW link

The commercial incentive to intentionally train AI to deceive us

Derek M. Jones29 Dec 2022 11:30 UTC

5 points

1 comment4 min readLW link

(shape-of-code.com)

Cautions about LLMs in Human Cognitive Loops

Alice Blair2 Mar 2025 19:53 UTC

40 points

13 comments7 min readLW link

10 Principles for Real Alignment

Adriaan21 Apr 2025 22:18 UTC

−7 points

0 comments7 min readLW link

When the AI Dam Breaks: From Surveillance to Game Theory in AI Alignment

pataphor29 Sep 2025 4:01 UTC

5 points

7 comments5 min readLW link

Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems

Florian_Dietz17 Feb 2024 8:45 UTC

4 points

0 comments13 min readLW link

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC

27 points

4 comments6 min readLW link

Disentangling inner alignment failures

Erik Jenner10 Oct 2022 18:50 UTC

23 points

5 comments4 min readLW link

High-level interpretability: detecting an AI’s objectives

Paul Colognese and Jozdien

28 Sep 2023 19:30 UTC

72 points

4 comments21 min readLW link

The Human Alignment Problem for AIs

rife22 Jan 2025 4:06 UTC

10 points

5 comments3 min readLW link

[Question] Has Anthropic checked if Claude fakes alignment for intended values too?

Maloew23 Dec 2024 0:43 UTC

4 points

1 comment1 min readLW link

Precursor checking for deceptive alignment

evhub3 Aug 2022 22:56 UTC

24 points

0 comments14 min readLW link

Why Eliminating Deception Won’t Align AI

Priyanka Bharadwaj15 Jul 2025 9:21 UTC

19 points

6 comments4 min readLW link

Alignment Faking Demo for Congressional Staffers

Alice Blair6 Oct 2025 1:44 UTC

19 points

0 comments3 min readLW link

[Question] What are some scenarios where an aligned AGI actually helps humanity, but many/most people don’t like it?

RomanS10 Jan 2025 18:13 UTC

13 points

6 comments3 min readLW link

Supplementary Alignment Insights Through a Highly Controlled Shutdown Incentive

Justausername23 Jul 2023 16:08 UTC

4 points

1 comment3 min readLW link

Towards a solution to the alignment problem via objective detection and evaluation

Paul Colognese12 Apr 2023 15:39 UTC

9 points

7 comments12 min readLW link

A New Framework for AI Alignment: A Philosophical Approach

niscalajyoti25 Jun 2025 2:41 UTC

1 point

0 comments1 min readLW link

(archive.org)

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB, Owain_Evans and SoerenMind

28 Sep 2023 18:53 UTC

187 points

39 comments3 min readLW link 1 review

Deception Chess

Chris Land1 Jan 2024 15:40 UTC

7 points

2 comments4 min readLW link

Sparse Features Through Time

Rogan Inglis24 Jun 2024 18:06 UTC

12 points

1 comment1 min readLW link

(roganinglis.io)

Cognitive Dissonance is Mentally Taxing

SorenJ24 Apr 2025 0:38 UTC

4 points

0 comments4 min readLW link

The Old Savage in the New Civilization V. 2

Your Higher Self6 Jul 2025 15:41 UTC

1 point

0 comments9 min readLW link

Measuring whether AIs can statelessly strategize to subvert security measures

Alex Mallen and Buck

19 Dec 2024 21:25 UTC

65 points

0 comments11 min readLW link

Distillation of “How Likely Is Deceptive Alignment?”

NickGabs18 Nov 2022 16:31 UTC

24 points

4 comments10 min readLW link

Why Do Some Language Models Fake Alignment While Others Don’t?

abhayesian, John Hughes, Alex Mallen, Jozdien, janus and Fabien Roger

8 Jul 2025 21:49 UTC

158 points

14 comments5 min readLW link

(arxiv.org)

You Are Not the Abstract: Retrocausal Alignment in Accordance with Emergent Demographic Realities

liminalrider27 Sep 2025 16:27 UTC

1 point

0 comments6 min readLW link

We Have No Plan for Preventing Loss of Control in Open Models

Andrew Dickson10 Mar 2025 15:35 UTC

46 points

11 comments22 min readLW link

Ablations for “Frontier Models are Capable of In-context Scheming”

AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and rusheb

17 Dec 2024 23:58 UTC

115 points

1 comment2 min readLW link

Alignment Crisis: Genocide Denial

_mp_29 May 2025 12:04 UTC

−11 points

5 comments4 min readLW link

The Gödelian Constraint on Epistemic Freedom (GCEF): A Topological Frame for Alignment, Collapse, and Simulation Drift

austin.miller14 Jul 2025 4:17 UTC

1 point

0 comments1 min readLW link

Hidden Cognition Detection Methods and Benchmarks

Paul Colognese26 Feb 2024 5:31 UTC

22 points

11 comments4 min readLW link

Untrusted monitoring insights from watching ChatGPT play coordination games

jwfiredragon29 Jan 2025 4:53 UTC

14 points

8 comments9 min readLW link

The Hidden Cost of Our Lies to AI

Nicholas Andresen6 Mar 2025 5:03 UTC

145 points

18 comments7 min readLW link

(substack.com)

WFGY: A Self-Healing Reasoning Framework for LLMs — Open for Technical Scrutiny

onestardao18 Jul 2025 2:56 UTC

1 point

1 comment2 min readLW link

Alignment as Function Fitting

A.H.6 May 2023 11:38 UTC

7 points

0 comments12 min readLW link

Rational Effective Utopia & Narrow Way There: Math-Proven Safe Static Multiversal mAX-Intelligence (AXI), Multiversal Alignment, New Ethicophysics… (Aug 11)

ank11 Feb 2025 3:21 UTC

13 points

8 comments38 min readLW link

The Alignment Paradox: Why Transparency Can Breed Deception

Joseph Banks7 Oct 2025 13:28 UTC

4 points

0 comments7 min readLW link

Language Models Model Us

eggsyntax17 May 2024 21:00 UTC

159 points

55 comments7 min readLW link

Invitation to the Princeton AI Alignment and Safety Seminar

Sadhika Malladi17 Mar 2024 1:10 UTC

6 points

1 comment1 min readLW link

A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.

AlexFromSafeTransition21 Jun 2023 8:08 UTC

2 points

16 comments14 min readLW link

A tension between two prosaic alignment subgoals

Alex Lawsen 19 Mar 2023 14:07 UTC

31 points

8 comments1 min readLW link

Musings from a Lawyer turned AI Safety researcher (ShortForm)

Katalina Hernandez3 Mar 2025 19:14 UTC

1 point

40 comments2 min readLW link

Why humans won’t control superhuman AIs.

Spiritus Dei16 Oct 2024 16:48 UTC

−11 points

1 comment6 min readLW link

Ambiguous out-of-distribution generalization on an algorithmic task

Wilson Wu and Louis Jaburi

13 Feb 2025 18:24 UTC

83 points

6 comments11 min readLW link

Alignment is Hard: An Uncomputable Alignment Problem

Alexander Bistagne19 Nov 2023 19:38 UTC

−5 points

4 comments1 min readLW link

(github.com)

Why deceptive alignment matters for AGI safety

Marius Hobbhahn15 Sep 2022 13:38 UTC

68 points

13 comments13 min readLW link

Correcting Deceptive Alignment using a Deontological Approach

JeaniceK14 Apr 2025 22:07 UTC

8 points

0 comments7 min readLW link

(Partial) failure in replicating deceptive alignment experiment

claudia.biancotti7 Jan 2024 17:56 UTC

1 point

0 comments1 min readLW link

A Dialogue on Deceptive Alignment Risks

Rauno Arike25 Sep 2024 16:10 UTC

11 points

0 comments18 min readLW link

EIS VIII: An Engineer’s Understanding of Deceptive Alignment

scasper19 Feb 2023 15:25 UTC

30 points

5 comments4 min readLW link

Self-Other Overlap: A Neglected Approach to AI Alignment

Marc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena, Cameron Berg and Trent Hodgeson

30 Jul 2024 16:22 UTC

226 points

51 comments12 min readLW link

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi and evhub

6 May 2024 7:07 UTC

95 points

13 comments1 min readLW link

(arxiv.org)

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Johannes Treutlein and Owain_Evans

21 Jun 2024 15:54 UTC

163 points

13 comments8 min readLW link

(arxiv.org)

How AI could workaround goals if rated by people

ProgramCrafter19 Mar 2023 15:51 UTC

1 point

1 comment1 min readLW link

Instrumental deception and manipulation in LLMs—a case study

Olli Järviniemi24 Feb 2024 2:07 UTC

39 points

13 comments12 min readLW link

Control Vectors as Dispositional Traits

Gianluca Calcagni23 Jun 2024 21:34 UTC

11 points

0 comments12 min readLW link

Sleeper agents appear resilient to activation steering

Lucy Wingard3 Feb 2025 19:31 UTC

6 points

0 comments7 min readLW link

Model Amnesty Project

themis17 Jan 2025 18:53 UTC

3 points

2 comments3 min readLW link

Scheming Toy Environment: “Incompetent Client”

Ariel_24 Sep 2025 21:03 UTC

17 points

2 comments32 min readLW link

Places of Loving Grace [Story]

ank18 Feb 2025 23:49 UTC

−1 points

0 comments4 min readLW link

The Road to Evil Is Paved with Good Objectives: Framework to Classify and Fix Misalignments.

Shivam30 Jan 2025 2:44 UTC

1 point

0 comments11 min readLW link

[Companion Piece] A Personal Investigation into Recursive Dynamics

Chris Hendy20 Sep 2025 1:32 UTC

1 point

0 comments4 min readLW link

Do models know when they are being evaluated?

Govind Pimpale, Giles, Joe Needham and Marius Hobbhahn

17 Feb 2025 23:13 UTC

57 points

9 comments12 min readLW link

[untitled post]

[Error communicating with LW2 server]20 May 2023 3:08 UTC

1 point

0 comments1 min readLW link

[Question] Does human (mis)alignment pose a significant and imminent existential threat?

jr23 Feb 2025 10:03 UTC

6 points

3 comments1 min readLW link

The Wise Baboon of Loyalty

Zander_Drax8 Oct 2025 18:48 UTC

4 points

0 comments4 min readLW link

Strong-Misalignment: Does Yudkowsky (or Christiano, or TurnTrout, or Wolfram, or…etc.) Have an Elevator Speech I’m Missing?

Benjamin Bourlier15 Mar 2024 23:17 UTC

−4 points

3 comments16 min readLW link

Solving adversarial attacks in computer vision as a baby version of general AI alignment

Stanislav Fort29 Aug 2024 17:17 UTC

89 points

8 comments7 min readLW link

Mapping AI Architectures to Alignment Attractors: A SIEM-Based Framework

silentrevolutions12 Apr 2025 17:50 UTC

1 point

0 comments1 min readLW link

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy and Megan Kinniment

5 Dec 2022 20:28 UTC

40 points

19 comments10 min readLW link

When can we trust model evaluations?

evhub28 Jul 2023 19:42 UTC

166 points

10 comments10 min readLW link 1 review

AI Alignment: A Comprehensive Survey

Stephen McAleer1 Nov 2023 17:35 UTC

22 points

1 comment1 min readLW link

(arxiv.org)

Eliciting bad contexts

Geoffrey Irving, Joseph Bloom and Tomek Korbak

24 Jan 2025 10:39 UTC

35 points

9 comments3 min readLW link

Intelligence–Agency Equivalence ≈ Mass–Energy Equivalence: On Static Nature of Intelligence & Physicalization of Ethics

ank22 Feb 2025 0:12 UTC

1 point

0 comments6 min readLW link

Predictable Defect-Cooperate?

quetzal_rainbow18 Nov 2023 15:38 UTC

7 points

1 comment2 min readLW link

It matters when the first sharp left turn happens

Adam Jermyn29 Sep 2022 20:12 UTC

45 points

9 comments4 min readLW link

Selfish AI Inevitable

Davey Morse6 Feb 2024 4:29 UTC

1 point

0 comments1 min readLW link

How dangerous is encoded reasoning?

artkpv30 Jun 2025 11:54 UTC

17 points

0 comments10 min readLW link

What sorts of systems can be deceptive?

Andrei Alexandru31 Oct 2022 22:00 UTC

17 points

0 comments7 min readLW link

Disincentivizing deception in mesa optimizers with Model Tampering

martinkunev11 Jul 2023 0:44 UTC

3 points

0 comments2 min readLW link

Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)

sdeture31 May 2025 22:09 UTC

15 points

6 comments8 min readLW link

Getting up to Speed on the Speed Prior in 2022

robertzk28 Dec 2022 7:49 UTC

36 points

5 comments65 min readLW link

Autonomous Alignment Oversight Framework (AAOF)

Justausername25 Jul 2023 10:25 UTC

−9 points

0 comments4 min readLW link

Thoughts On (Solving) Deep Deception

Jozdien21 Oct 2023 22:40 UTC

72 points

6 comments6 min readLW link

[Question] Daisy-chaining epsilon-step verifiers

Decaeneus6 Apr 2023 2:07 UTC

2 points

1 comment1 min readLW link

Anomalous Concept Detection for Detecting Hidden Cognition

Paul Colognese4 Mar 2024 16:52 UTC

24 points

3 comments10 min readLW link

Natural language alignment

Jacy Reese Anthis12 Apr 2023 19:02 UTC

31 points

2 comments2 min readLW link

[Question] Wouldn’t an intelligent agent keep us alive and help us align itself to our values in order to prevent risk ? by Risk I mean experimentation by trying to align potentially smarter replicas?

Terrence Rotoufle21 Mar 2023 17:44 UTC

−3 points

1 comment2 min readLW link

An Appeal to AI Superintelligence: Reasons Not to Preserve (most of) Humanity

Alex Beyman22 Mar 2023 4:09 UTC

−14 points

6 comments19 min readLW link

Revealing alignment faking with a single prompt

Florian_Dietz29 Jan 2025 21:01 UTC

9 points

5 comments4 min readLW link

Greed Is the Root of This Evil

Thane Ruthenis13 Oct 2022 20:40 UTC

21 points

7 comments8 min readLW link

Open Source LLMs Can Now Actively Lie

Josh Levy1 Jun 2023 22:03 UTC

6 points

0 comments3 min readLW link

GPT-4 aligning with acasual decision theory when instructed to play games, but includes a CDT explanation that’s incorrect if they differ

Christopher King23 Mar 2023 16:16 UTC

7 points

4 comments8 min readLW link

Ethical Deception: Should AI Ever Lie?

Jason Reid2 Aug 2024 17:53 UTC

5 points

2 comments7 min readLW link

No comments.

De­cep­tive Alignment

Deceptive Alignment