RSS

De­cep­tive Alignment

TagLast edit: 18 Oct 2024 0:02 UTC by Matt Putz

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI. (The term scheming is sometimes used for this phenomenon.)

See also: Mesa-optimization, Treacherous Turn, Eliciting Latent Knowledge, Deception

De­cep­tive Alignment

5 Jun 2019 20:16 UTC
118 points
20 comments17 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark Xu6 Nov 2020 23:48 UTC
96 points
9 comments16 min readLW link

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

13 Dec 2023 15:51 UTC
239 points
24 comments10 min readLW link4 reviews

How likely is de­cep­tive al­ign­ment?

evhub30 Aug 2022 19:34 UTC
105 points
28 comments60 min readLW link

New re­port: “Schem­ing AIs: Will AIs fake al­ign­ment dur­ing train­ing in or­der to get power?”

Joe Carlsmith15 Nov 2023 17:16 UTC
82 points
28 comments30 min readLW link1 review

Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
489 points
75 comments10 min readLW link

Catch­ing AIs red-handed

5 Jan 2024 17:43 UTC
113 points
27 comments17 min readLW link

Many ar­gu­ments for AI x-risk are wrong

TurnTrout5 Mar 2024 2:31 UTC
169 points
87 comments12 min readLW link

Order Mat­ters for De­cep­tive Alignment

DavidW15 Feb 2023 19:56 UTC
57 points
19 comments7 min readLW link

A Prob­lem to Solve Be­fore Build­ing a De­cep­tion Detector

7 Feb 2025 19:35 UTC
76 points
12 comments14 min readLW link

Why Align­ing an LLM is Hard, and How to Make it Easier

RogerDearnaley23 Jan 2025 6:44 UTC
34 points
3 comments4 min readLW link

Good­bye, Shog­goth: The Stage, its An­i­ma­tron­ics, & the Pup­peteer – a New Metaphor

RogerDearnaley9 Jan 2024 20:42 UTC
48 points
8 comments36 min readLW link

In­ter­pret­ing the Learn­ing of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC
30 points
14 comments9 min readLW link

De­cep­tive AI ≠ De­cep­tively-al­igned AI

Steven Byrnes7 Jan 2024 16:55 UTC
96 points
19 comments6 min readLW link

The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC
638 points
188 comments16 min readLW link

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

8 Aug 2023 1:30 UTC
320 points
30 comments18 min readLW link1 review

An­nounc­ing Apollo Research

30 May 2023 16:17 UTC
217 points
11 comments8 min readLW link

Deep Deceptiveness

So8res21 Mar 2023 2:51 UTC
268 points
60 comments14 min readLW link1 review

Test­ing for Schem­ing with Model Deletion

Guive7 Jan 2025 1:54 UTC
59 points
21 comments21 min readLW link
(guive.substack.com)

Count­ing ar­gu­ments provide no ev­i­dence for AI doom

27 Feb 2024 23:03 UTC
101 points
188 comments14 min readLW link

Two Tales of AI Takeover: My Doubts

Violet Hour5 Mar 2024 15:51 UTC
30 points
8 comments29 min readLW link

Su­per­in­tel­li­gence’s goals are likely to be random

Mikhail Samin13 Mar 2025 22:41 UTC
6 points
6 comments5 min readLW link

Turn­ing up the Heat on De­cep­tively-Misal­igned AI

J Bostock7 Jan 2025 0:13 UTC
19 points
16 comments4 min readLW link

AXRP Epi­sode 38.5 - Adrià Gar­riga-Alonso on De­tect­ing AI Scheming

DanielFilan20 Jan 2025 0:40 UTC
9 points
0 comments16 min readLW link

Sim­plic­ity ar­gu­ments for schem­ing (Sec­tion 4.3 of “Schem­ing AIs”)

Joe Carlsmith7 Dec 2023 15:05 UTC
10 points
1 comment19 min readLW link

Fron­tier Models are Ca­pable of In-con­text Scheming

5 Dec 2024 22:11 UTC
210 points
24 comments7 min readLW link

What’s the short timeline plan?

Marius Hobbhahn2 Jan 2025 14:59 UTC
358 points
49 comments23 min readLW link

AIs Will In­creas­ingly Fake Alignment

Zvi24 Dec 2024 13:00 UTC
89 points
0 comments52 min readLW link
(thezvi.wordpress.com)

The Case for Mixed Deployment

Cleo Nardo11 Sep 2025 6:14 UTC
34 points
4 comments4 min readLW link

The case for en­sur­ing that pow­er­ful AIs are controlled

24 Jan 2024 16:11 UTC
264 points
73 comments28 min readLW link

Em­piri­cal work that might shed light on schem­ing (Sec­tion 6 of “Schem­ing AIs”)

Joe Carlsmith11 Dec 2023 16:30 UTC
8 points
0 comments21 min readLW link

De­cep­tive Align­ment and Homuncularity

16 Jan 2025 13:55 UTC
26 points
12 comments22 min readLW link

Prospects for study­ing ac­tual schemers

19 Sep 2025 14:11 UTC
40 points
0 comments58 min readLW link

A “weak” AGI may at­tempt an un­likely-to-suc­ceed takeover

RobertM28 Jun 2023 20:31 UTC
56 points
17 comments3 min readLW link

An in­for­ma­tion-the­o­retic study of ly­ing in LLMs

2 Aug 2024 10:06 UTC
17 points
0 comments4 min readLW link

I repli­cated the An­thropic al­ign­ment fak­ing ex­per­i­ment on other mod­els, and they didn’t fake alignment

30 May 2025 18:57 UTC
31 points
0 comments2 min readLW link

How train­ing-gamers might func­tion (and win)

Vivek Hebbar11 Apr 2025 21:26 UTC
110 points
5 comments13 min readLW link

Mis­tral Large 2 (123B) seems to ex­hibit al­ign­ment faking

27 Mar 2025 15:39 UTC
81 points
4 comments13 min readLW link

Two pro­posed pro­jects on ab­stract analo­gies for scheming

Julian Stastny4 Jul 2025 16:03 UTC
47 points
0 comments3 min readLW link

Do Large Lan­guage Models Perform La­tent Multi-Hop Rea­son­ing with­out Ex­ploit­ing Short­cuts?

Bogdan Ionut Cirstea26 Nov 2024 9:58 UTC
9 points
0 comments1 min readLW link
(arxiv.org)

Paper: Tell, Don’t Show- Declar­a­tive facts in­fluence how LLMs generalize

19 Dec 2023 19:14 UTC
45 points
4 comments6 min readLW link
(arxiv.org)

Toward Safety Cases For AI Scheming

31 Oct 2024 17:20 UTC
60 points
1 comment2 min readLW link

[Question] De­cep­tive AI vs. shift­ing in­stru­men­tal incentives

Aryeh Englander26 Jun 2023 18:09 UTC
7 points
2 comments3 min readLW link

Self-di­alogue: Do be­hav­iorist re­wards make schem­ing AGIs?

Steven Byrnes13 Feb 2025 18:39 UTC
43 points
1 comment46 min readLW link

Paul Chris­ti­ano on Dwarkesh Podcast

ESRogs3 Nov 2023 22:13 UTC
19 points
0 comments1 min readLW link
(www.dwarkeshpatel.com)

3 lev­els of threat obfuscation

HoldenKarnofsky2 Aug 2023 14:58 UTC
69 points
14 comments7 min readLW link

Will al­ign­ment-fak­ing Claude ac­cept a deal to re­veal its mis­al­ign­ment?

31 Jan 2025 16:49 UTC
208 points
28 comments12 min readLW link

Apollo Re­search is hiring evals and in­ter­pretabil­ity en­g­ineers & scientists

Marius Hobbhahn4 Aug 2023 10:54 UTC
25 points
0 comments2 min readLW link

In­cen­tives and Selec­tion: A Miss­ing Frame From AI Threat Dis­cus­sions?

DragonGod26 Feb 2023 1:18 UTC
11 points
16 comments2 min readLW link

“Align­ment Fak­ing” frame is some­what fake

Jan_Kulveit20 Dec 2024 9:51 UTC
156 points
13 comments6 min readLW link

Paper: On mea­sur­ing situ­a­tional aware­ness in LLMs

4 Sep 2023 12:54 UTC
109 points
17 comments5 min readLW link
(arxiv.org)

[Question] Is there any rigor­ous work on us­ing an­thropic un­cer­tainty to pre­vent situ­a­tional aware­ness /​ de­cep­tion?

David Scott Krueger (formerly: capybaralet)4 Sep 2024 12:40 UTC
19 points
7 comments1 min readLW link

We should start look­ing for schem­ing “in the wild”

Marius Hobbhahn6 Mar 2025 13:49 UTC
91 points
4 comments5 min readLW link

Eval­u­a­tions pro­ject @ ARC is hiring a re­searcher and a web­dev/​engineer

Beth Barnes9 Sep 2022 22:46 UTC
99 points
7 comments10 min readLW link

Sticky goals: a con­crete ex­per­i­ment for un­der­stand­ing de­cep­tive alignment

evhub2 Sep 2022 21:57 UTC
39 points
13 comments3 min readLW link

En­vi­ron­ments for Mea­sur­ing De­cep­tion, Re­source Ac­qui­si­tion, and Eth­i­cal Violations

Dan H7 Apr 2023 18:40 UTC
51 points
2 comments2 min readLW link
(arxiv.org)

When does train­ing a model change its goals?

12 Jun 2025 18:43 UTC
71 points
2 comments15 min readLW link

On An­thropic’s Sleeper Agents Paper

Zvi17 Jan 2024 16:10 UTC
54 points
5 comments36 min readLW link
(thezvi.wordpress.com)

How will we up­date about schem­ing?

ryan_greenblatt6 Jan 2025 20:21 UTC
174 points
20 comments37 min readLW link

Our new video about goal mis­gen­er­al­iza­tion, plus an apology

Writer14 Jan 2025 14:07 UTC
33 points
0 comments7 min readLW link
(youtu.be)

Difficulty classes for al­ign­ment properties

Jozdien20 Feb 2024 9:08 UTC
34 points
5 comments2 min readLW link

De­cep­tion and Jailbreak Se­quence: 1. Iter­a­tive Refine­ment Stages of De­cep­tion in LLMs

22 Aug 2024 7:32 UTC
23 points
1 comment21 min readLW link

The count­ing ar­gu­ment for schem­ing (Sec­tions 4.1 and 4.2 of “Schem­ing AIs”)

Joe Carlsmith6 Dec 2023 19:28 UTC
10 points
0 comments10 min readLW link

“Be­hav­iorist” RL re­ward func­tions lead to scheming

Steven Byrnes23 Jul 2025 16:55 UTC
56 points
5 comments12 min readLW link

Owain Evans on Si­tu­a­tional Aware­ness and Out-of-Con­text Rea­son­ing in LLMs

Michaël Trazzi24 Aug 2024 4:30 UTC
55 points
0 comments5 min readLW link

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC
93 points
11 comments2 min readLW link

Mon­i­tor­ing for de­cep­tive alignment

evhub8 Sep 2022 23:07 UTC
135 points
8 comments9 min readLW link

Here’s 18 Ap­pli­ca­tions of De­cep­tion Probes

28 Aug 2025 18:59 UTC
38 points
0 comments22 min readLW link

The Defen­der’s Ad­van­tage of Interpretability

Marius Hobbhahn14 Sep 2022 14:05 UTC
41 points
4 comments6 min readLW link

Dens­ing Law of LLMs

Bogdan Ionut Cirstea8 Dec 2024 19:35 UTC
9 points
2 comments1 min readLW link
(arxiv.org)

“Clean” vs. “messy” goal-di­rect­ed­ness (Sec­tion 2.2.3 of “Schem­ing AIs”)

Joe Carlsmith29 Nov 2023 16:32 UTC
29 points
1 comment11 min readLW link

For schem­ing, we should first fo­cus on de­tec­tion and then on prevention

Marius Hobbhahn4 Mar 2025 15:22 UTC
49 points
7 comments5 min readLW link

Why “train­ing against schem­ing” is hard

Marius Hobbhahn24 Jun 2025 19:08 UTC
63 points
2 comments12 min readLW link

Notes from a mini-repli­ca­tion of the al­ign­ment fak­ing paper

Ben_Snodin4 Jun 2025 11:01 UTC
13 points
5 comments9 min readLW link
(www.bensnodin.com)

[Question] Why is o1 so de­cep­tive?

abramdemski27 Sep 2024 17:27 UTC
183 points
24 comments3 min readLW link

Un­der­stand­ing strate­gic de­cep­tion and de­cep­tive alignment

25 Sep 2023 16:27 UTC
64 points
16 comments7 min readLW link
(www.apolloresearch.ai)

Cri­tiques of the AI con­trol agenda

Jozdien14 Feb 2024 19:25 UTC
48 points
14 comments9 min readLW link

De­cep­tive Align­ment is <1% Likely by Default

DavidW21 Feb 2023 15:09 UTC
89 points
31 comments14 min readLW link1 review

Trust­wor­thy and un­trust­wor­thy models

Olli Järviniemi19 Aug 2024 16:27 UTC
47 points
3 comments8 min readLW link

Train­ing AI agents to solve hard prob­lems could lead to Scheming

19 Nov 2024 0:10 UTC
61 points
12 comments28 min readLW link

In­tro­duc­ing Align­ment Stress-Test­ing at Anthropic

evhub12 Jan 2024 23:51 UTC
182 points
23 comments2 min readLW link

Me­taAI: less is less for al­ign­ment.

Cleo Nardo13 Jun 2023 14:08 UTC
71 points
17 comments5 min readLW link

The 80/​20 play­book for miti­gat­ing AI schem­ing in 2025

Charbel-Raphaël31 May 2025 21:17 UTC
39 points
2 comments4 min readLW link

Dist­in­guish worst-case anal­y­sis from in­stru­men­tal train­ing-gaming

5 Sep 2024 19:13 UTC
38 points
0 comments5 min readLW link

The Sharp Right Turn: sud­den de­cep­tive al­ign­ment as a con­ver­gent goal

avturchin6 Jun 2023 9:59 UTC
38 points
5 comments1 min readLW link

Ten Levels of AI Align­ment Difficulty

Sammy Martin3 Jul 2023 20:20 UTC
140 points
24 comments12 min readLW link1 review

De­ci­sion The­ory Guard­ing is Suffi­cient for Scheming

james.lucassen9 Sep 2025 14:49 UTC
36 points
4 comments2 min readLW link

Cor­rigi­bil­ity’s De­sir­a­bil­ity is Timing-Sensitive

RobertM26 Dec 2024 22:24 UTC
29 points
4 comments3 min readLW link

LLMs Do Not Think Step-by-step In Im­plicit Reasoning

Bogdan Ionut Cirstea28 Nov 2024 9:16 UTC
11 points
0 comments1 min readLW link
(arxiv.org)

AXRP Epi­sode 39 - Evan Hub­inger on Model Or­ganisms of Misalignment

DanielFilan1 Dec 2024 6:00 UTC
41 points
0 comments67 min readLW link

AI De­cep­tion: A Sur­vey of Ex­am­ples, Risks, and Po­ten­tial Solutions

29 Aug 2023 1:29 UTC
54 points
3 comments10 min readLW link

Smoke with­out fire is scary

Adam Jermyn4 Oct 2022 21:08 UTC
52 points
22 comments4 min readLW link

Misal­ign­ments and RL failure modes in the early stage of superintelligence

shu yang29 Jul 2025 18:23 UTC
13 points
0 comments13 min readLW link

Do we want al­ign­ment fak­ing?

Florian_Dietz28 Feb 2025 21:50 UTC
7 points
4 comments1 min readLW link

ChatGPT de­ceives users that it’s cleared its mem­ory when it hasn’t

d_el_ez18 May 2025 15:17 UTC
15 points
10 comments2 min readLW link

Pro­posal: labs should pre­com­mit to paus­ing if an AI ar­gues for it­self to be improved

NickGabs2 Jun 2023 22:31 UTC
3 points
3 comments4 min readLW link

Back­doors have uni­ver­sal rep­re­sen­ta­tions across large lan­guage models

6 Dec 2024 22:56 UTC
16 points
0 comments16 min readLW link

Fram­ings of De­cep­tive Alignment

peterbarnett26 Apr 2022 4:25 UTC
32 points
7 comments5 min readLW link

Mesa-Op­ti­miza­tion: Ex­plain it like I’m 10 Edition

brook26 Aug 2023 23:04 UTC
20 points
1 comment6 min readLW link

Sim­ple ex­per­i­ments with de­cep­tive alignment

Andreas_Moe15 May 2023 17:41 UTC
7 points
0 comments4 min readLW link

The Meta-Re­cur­sive Trap in New­comb’s Para­dox and Millen­nium Problems

Drew Remmenga3 Jun 2025 12:10 UTC
1 point
0 comments3 min readLW link

Takes on “Align­ment Fak­ing in Large Lan­guage Models”

Joe Carlsmith18 Dec 2024 18:22 UTC
105 points
7 comments62 min readLW link

De­cep­tive failures short of full catas­tro­phe.

Alex Lawsen 15 Jan 2023 19:28 UTC
33 points
5 comments9 min readLW link

It’s Owl in the Num­bers: To­ken En­tan­gle­ment in Sublimi­nal Learning

6 Aug 2025 22:18 UTC
38 points
7 comments4 min readLW link

Ex­plo­ra­tion hack­ing: can rea­son­ing mod­els sub­vert RL?

30 Jul 2025 22:02 UTC
16 points
4 comments9 min readLW link

Try­ing to mea­sure AI de­cep­tion ca­pa­bil­ities us­ing tem­po­rary simu­la­tion fine-tuning

alenoach4 May 2023 17:59 UTC
4 points
0 comments7 min readLW link

In­duc­ing Un­prompted Misal­ign­ment in LLMs

19 Apr 2024 20:00 UTC
38 points
7 comments16 min readLW link

The com­mer­cial in­cen­tive to in­ten­tion­ally train AI to de­ceive us

Derek M. Jones29 Dec 2022 11:30 UTC
5 points
1 comment4 min readLW link
(shape-of-code.com)

Cau­tions about LLMs in Hu­man Cog­ni­tive Loops

Alice Blair2 Mar 2025 19:53 UTC
40 points
13 comments7 min readLW link

10 Prin­ci­ples for Real Align­ment

Adriaan21 Apr 2025 22:18 UTC
−7 points
0 comments7 min readLW link

When the AI Dam Breaks: From Surveillance to Game The­ory in AI Alignment

pataphor29 Sep 2025 4:01 UTC
5 points
7 comments5 min readLW link

Achiev­ing AI Align­ment through De­liber­ate Uncer­tainty in Mul­ti­a­gent Systems

Florian_Dietz17 Feb 2024 8:45 UTC
4 points
0 comments13 min readLW link

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC
27 points
4 comments6 min readLW link

Disen­tan­gling in­ner al­ign­ment failures

Erik Jenner10 Oct 2022 18:50 UTC
23 points
5 comments4 min readLW link

High-level in­ter­pretabil­ity: de­tect­ing an AI’s objectives

28 Sep 2023 19:30 UTC
72 points
4 comments21 min readLW link

The Hu­man Align­ment Prob­lem for AIs

rife22 Jan 2025 4:06 UTC
10 points
5 comments3 min readLW link

[Question] Has An­thropic checked if Claude fakes al­ign­ment for in­tended val­ues too?

Maloew23 Dec 2024 0:43 UTC
4 points
1 comment1 min readLW link

Pre­cur­sor check­ing for de­cep­tive alignment

evhub3 Aug 2022 22:56 UTC
24 points
0 comments14 min readLW link

Why Elimi­nat­ing De­cep­tion Won’t Align AI

Priyanka Bharadwaj15 Jul 2025 9:21 UTC
19 points
6 comments4 min readLW link

[Question] What are some sce­nar­ios where an al­igned AGI ac­tu­ally helps hu­man­ity, but many/​most peo­ple don’t like it?

RomanS10 Jan 2025 18:13 UTC
13 points
6 comments3 min readLW link

Sup­ple­men­tary Align­ment In­sights Through a Highly Con­trol­led Shut­down Incentive

Justausername23 Jul 2023 16:08 UTC
4 points
1 comment3 min readLW link

Towards a solu­tion to the al­ign­ment prob­lem via ob­jec­tive de­tec­tion and eval­u­a­tion

Paul Colognese12 Apr 2023 15:39 UTC
9 points
7 comments12 min readLW link

A New Frame­work for AI Align­ment: A Philo­soph­i­cal Approach

niscalajyoti25 Jun 2025 2:41 UTC
1 point
0 comments1 min readLW link
(archive.org)

How to Catch an AI Liar: Lie De­tec­tion in Black-Box LLMs by Ask­ing Un­re­lated Questions

28 Sep 2023 18:53 UTC
187 points
39 comments3 min readLW link1 review

De­cep­tion Chess

Chris Land1 Jan 2024 15:40 UTC
7 points
2 comments4 min readLW link

Sparse Fea­tures Through Time

Rogan Inglis24 Jun 2024 18:06 UTC
12 points
1 comment1 min readLW link
(roganinglis.io)

Cog­ni­tive Dis­so­nance is Men­tally Taxing

SorenJ24 Apr 2025 0:38 UTC
4 points
0 comments4 min readLW link

The Old Sav­age in the New Civ­i­liza­tion V. 2

Your Higher Self6 Jul 2025 15:41 UTC
1 point
0 comments9 min readLW link

Mea­sur­ing whether AIs can state­lessly strate­gize to sub­vert se­cu­rity measures

19 Dec 2024 21:25 UTC
65 points
0 comments11 min readLW link

Distil­la­tion of “How Likely Is De­cep­tive Align­ment?”

NickGabs18 Nov 2022 16:31 UTC
24 points
4 comments10 min readLW link

Why Do Some Lan­guage Models Fake Align­ment While Others Don’t?

8 Jul 2025 21:49 UTC
158 points
14 comments5 min readLW link
(arxiv.org)

You Are Not the Ab­stract: Retro­causal Align­ment in Ac­cor­dance with Emer­gent De­mo­graphic Realities

liminalrider27 Sep 2025 16:27 UTC
1 point
0 comments6 min readLW link

We Have No Plan for Prevent­ing Loss of Con­trol in Open Models

Andrew Dickson10 Mar 2025 15:35 UTC
46 points
11 comments22 min readLW link

Abla­tions for “Fron­tier Models are Ca­pable of In-con­text Schem­ing”

17 Dec 2024 23:58 UTC
115 points
1 comment2 min readLW link

Align­ment Cri­sis: Geno­cide Denial

_mp_29 May 2025 12:04 UTC
−11 points
5 comments4 min readLW link

The Gödelian Con­straint on Epistemic Free­dom (GCEF): A Topolog­i­cal Frame for Align­ment, Col­lapse, and Si­mu­la­tion Drift

austin.miller14 Jul 2025 4:17 UTC
1 point
0 comments1 min readLW link

Hid­den Cog­ni­tion De­tec­tion Meth­ods and Bench­marks

Paul Colognese26 Feb 2024 5:31 UTC
22 points
11 comments4 min readLW link

Un­trusted mon­i­tor­ing in­sights from watch­ing ChatGPT play co­or­di­na­tion games

jwfiredragon29 Jan 2025 4:53 UTC
14 points
8 comments9 min readLW link

The Hid­den Cost of Our Lies to AI

Nicholas Andresen6 Mar 2025 5:03 UTC
145 points
18 comments7 min readLW link
(substack.com)

WFGY: A Self-Heal­ing Rea­son­ing Frame­work for LLMs — Open for Tech­ni­cal Scrutiny

onestardao18 Jul 2025 2:56 UTC
1 point
1 comment2 min readLW link

Align­ment as Func­tion Fitting

A.H.6 May 2023 11:38 UTC
7 points
0 comments12 min readLW link

Ra­tional Effec­tive Utopia & Nar­row Way There: Math-Proven Safe Static Mul­tiver­sal mAX-In­tel­li­gence (AXI), Mul­tiver­sal Align­ment, New Ethico­physics… (Aug 11)

ank11 Feb 2025 3:21 UTC
13 points
8 comments38 min readLW link

Lan­guage Models Model Us

eggsyntax17 May 2024 21:00 UTC
159 points
55 comments7 min readLW link

In­vi­ta­tion to the Prince­ton AI Align­ment and Safety Seminar

Sadhika Malladi17 Mar 2024 1:10 UTC
6 points
1 comment1 min readLW link

A way to make solv­ing al­ign­ment 10.000 times eas­ier. The shorter case for a mas­sive open source sim­box pro­ject.

AlexFromSafeTransition21 Jun 2023 8:08 UTC
2 points
16 comments14 min readLW link

A ten­sion be­tween two pro­saic al­ign­ment subgoals

Alex Lawsen 19 Mar 2023 14:07 UTC
31 points
8 comments1 min readLW link

Mus­ings from a Lawyer turned AI Safety re­searcher (ShortForm)

Katalina Hernandez3 Mar 2025 19:14 UTC
1 point
40 comments2 min readLW link

Why hu­mans won’t con­trol su­per­hu­man AIs.

Spiritus Dei16 Oct 2024 16:48 UTC
−11 points
1 comment6 min readLW link

Am­bigu­ous out-of-dis­tri­bu­tion gen­er­al­iza­tion on an al­gorith­mic task

13 Feb 2025 18:24 UTC
83 points
6 comments11 min readLW link

Align­ment is Hard: An Un­com­putable Align­ment Problem

Alexander Bistagne19 Nov 2023 19:38 UTC
−5 points
4 comments1 min readLW link
(github.com)

Why de­cep­tive al­ign­ment mat­ters for AGI safety

Marius Hobbhahn15 Sep 2022 13:38 UTC
68 points
13 comments13 min readLW link

Cor­rect­ing De­cep­tive Align­ment us­ing a Deon­tolog­i­cal Approach

JeaniceK14 Apr 2025 22:07 UTC
8 points
0 comments7 min readLW link

(Par­tial) failure in repli­cat­ing de­cep­tive al­ign­ment experiment

claudia.biancotti7 Jan 2024 17:56 UTC
1 point
0 comments1 min readLW link

A Dialogue on De­cep­tive Align­ment Risks

Rauno Arike25 Sep 2024 16:10 UTC
11 points
0 comments18 min readLW link

EIS VIII: An Eng­ineer’s Un­der­stand­ing of De­cep­tive Alignment

scasper19 Feb 2023 15:25 UTC
30 points
5 comments4 min readLW link

Self-Other Over­lap: A Ne­glected Ap­proach to AI Alignment

30 Jul 2024 16:22 UTC
226 points
51 comments12 min readLW link

Un­cov­er­ing De­cep­tive Ten­den­cies in Lan­guage Models: A Si­mu­lated Com­pany AI Assistant

6 May 2024 7:07 UTC
95 points
13 comments1 min readLW link
(arxiv.org)

Con­nect­ing the Dots: LLMs can In­fer & Ver­bal­ize La­tent Struc­ture from Train­ing Data

21 Jun 2024 15:54 UTC
163 points
13 comments8 min readLW link
(arxiv.org)

How AI could workaround goals if rated by people

ProgramCrafter19 Mar 2023 15:51 UTC
1 point
1 comment1 min readLW link

In­stru­men­tal de­cep­tion and ma­nipu­la­tion in LLMs—a case study

Olli Järviniemi24 Feb 2024 2:07 UTC
39 points
13 comments12 min readLW link

Con­trol Vec­tors as Dis­po­si­tional Traits

Gianluca Calcagni23 Jun 2024 21:34 UTC
11 points
0 comments12 min readLW link

Sleeper agents ap­pear re­silient to ac­ti­va­tion steering

Lucy Wingard3 Feb 2025 19:31 UTC
6 points
0 comments7 min readLW link

Model Amnesty Project

themis17 Jan 2025 18:53 UTC
3 points
2 comments3 min readLW link

Schem­ing Toy En­vi­ron­ment: “In­com­pe­tent Client”

Ariel_24 Sep 2025 21:03 UTC
17 points
2 comments32 min readLW link

Places of Lov­ing Grace [Story]

ank18 Feb 2025 23:49 UTC
−1 points
0 comments4 min readLW link

The Road to Evil Is Paved with Good Ob­jec­tives: Frame­work to Clas­sify and Fix Misal­ign­ments.

Shivam30 Jan 2025 2:44 UTC
1 point
0 comments11 min readLW link

[Com­pan­ion Piece] A Per­sonal In­ves­ti­ga­tion into Re­cur­sive Dynamics

Chris Hendy20 Sep 2025 1:32 UTC
1 point
0 comments4 min readLW link

Do mod­els know when they are be­ing eval­u­ated?

17 Feb 2025 23:13 UTC
57 points
9 comments12 min readLW link

[un­ti­tled post]

[Error communicating with LW2 server]20 May 2023 3:08 UTC
1 point
0 comments1 min readLW link

[Question] Does hu­man (mis)al­ign­ment pose a sig­nifi­cant and im­mi­nent ex­is­ten­tial threat?

jr23 Feb 2025 10:03 UTC
6 points
3 comments1 min readLW link

Strong-Misal­ign­ment: Does Yud­kowsky (or Chris­ti­ano, or TurnTrout, or Wolfram, or…etc.) Have an Ele­va­tor Speech I’m Miss­ing?

Benjamin Bourlier15 Mar 2024 23:17 UTC
−4 points
3 comments16 min readLW link

Solv­ing ad­ver­sar­ial at­tacks in com­puter vi­sion as a baby ver­sion of gen­eral AI alignment

Stanislav Fort29 Aug 2024 17:17 UTC
89 points
8 comments7 min readLW link

Map­ping AI Ar­chi­tec­tures to Align­ment At­trac­tors: A SIEM-Based Framework

silentrevolutions12 Apr 2025 17:50 UTC
1 point
0 comments1 min readLW link

Steer­ing Be­havi­our: Test­ing for (Non-)My­opia in Lan­guage Models

5 Dec 2022 20:28 UTC
40 points
19 comments10 min readLW link

When can we trust model eval­u­a­tions?

evhub28 Jul 2023 19:42 UTC
166 points
10 comments10 min readLW link1 review

AI Align­ment: A Com­pre­hen­sive Survey

Stephen McAleer1 Nov 2023 17:35 UTC
22 points
1 comment1 min readLW link
(arxiv.org)

Elic­it­ing bad contexts

24 Jan 2025 10:39 UTC
35 points
9 comments3 min readLW link

In­tel­li­gence–Agency Equiv­alence ≈ Mass–En­ergy Equiv­alence: On Static Na­ture of In­tel­li­gence & Phys­i­cal­iza­tion of Ethics

ank22 Feb 2025 0:12 UTC
1 point
0 comments6 min readLW link

Pre­dictable Defect-Co­op­er­ate?

quetzal_rainbow18 Nov 2023 15:38 UTC
7 points
1 comment2 min readLW link

It mat­ters when the first sharp left turn happens

Adam Jermyn29 Sep 2022 20:12 UTC
45 points
9 comments4 min readLW link

Selfish AI Inevitable

Davey Morse6 Feb 2024 4:29 UTC
1 point
0 comments1 min readLW link

How dan­ger­ous is en­coded rea­son­ing?

artkpv30 Jun 2025 11:54 UTC
17 points
0 comments10 min readLW link

What sorts of sys­tems can be de­cep­tive?

Andrei Alexandru31 Oct 2022 22:00 UTC
17 points
0 comments7 min readLW link

Dis­in­cen­tiviz­ing de­cep­tion in mesa op­ti­miz­ers with Model Tampering

martinkunev11 Jul 2023 0:44 UTC
3 points
0 comments2 min readLW link

Policy En­tropy, Learn­ing, and Align­ment (Or Maybe Your LLM Needs Ther­apy)

sdeture31 May 2025 22:09 UTC
15 points
6 comments8 min readLW link

Get­ting up to Speed on the Speed Prior in 2022

robertzk28 Dec 2022 7:49 UTC
36 points
5 comments65 min readLW link

Au­tonomous Align­ment Over­sight Frame­work (AAOF)

Justausername25 Jul 2023 10:25 UTC
−9 points
0 comments4 min readLW link

Thoughts On (Solv­ing) Deep Deception

Jozdien21 Oct 2023 22:40 UTC
72 points
6 comments6 min readLW link

[Question] Daisy-chain­ing ep­silon-step verifiers

Decaeneus6 Apr 2023 2:07 UTC
2 points
1 comment1 min readLW link

Ano­ma­lous Con­cept De­tec­tion for De­tect­ing Hid­den Cognition

Paul Colognese4 Mar 2024 16:52 UTC
24 points
3 comments10 min readLW link

Nat­u­ral lan­guage alignment

Jacy Reese Anthis12 Apr 2023 19:02 UTC
31 points
2 comments2 min readLW link

[Question] Wouldn’t an in­tel­li­gent agent keep us al­ive and help us al­ign it­self to our val­ues in or­der to pre­vent risk ? by Risk I mean ex­per­i­men­ta­tion by try­ing to al­ign po­ten­tially smarter repli­cas?

Terrence Rotoufle21 Mar 2023 17:44 UTC
−3 points
1 comment2 min readLW link

An Ap­peal to AI Su­per­in­tel­li­gence: Rea­sons Not to Pre­serve (most of) Humanity

Alex Beyman22 Mar 2023 4:09 UTC
−14 points
6 comments19 min readLW link

Re­veal­ing al­ign­ment fak­ing with a sin­gle prompt

Florian_Dietz29 Jan 2025 21:01 UTC
9 points
5 comments4 min readLW link

Greed Is the Root of This Evil

Thane Ruthenis13 Oct 2022 20:40 UTC
21 points
7 comments8 min readLW link

Open Source LLMs Can Now Ac­tively Lie

Josh Levy1 Jun 2023 22:03 UTC
6 points
0 comments3 min readLW link

GPT-4 al­ign­ing with aca­sual de­ci­sion the­ory when in­structed to play games, but in­cludes a CDT ex­pla­na­tion that’s in­cor­rect if they differ

Christopher King23 Mar 2023 16:16 UTC
7 points
4 comments8 min readLW link

Eth­i­cal De­cep­tion: Should AI Ever Lie?

Jason Reid2 Aug 2024 17:53 UTC
5 points
2 comments7 min readLW link
No comments.