RSS

De­cep­tive Alignment

TagLast edit: Oct 18, 2024, 12:02 AM by Matt Putz

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI. (The term scheming is sometimes used for this phenomenon.)

See also: Mesa-optimization, Treacherous Turn, Eliciting Latent Knowledge, Deception

De­cep­tive Alignment

Jun 5, 2019, 8:16 PM
118 points
20 comments17 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark XuNov 6, 2020, 11:48 PM
96 points
9 comments16 min readLW link

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

Dec 13, 2023, 3:51 PM
236 points
24 comments10 min readLW link4 reviews

How likely is de­cep­tive al­ign­ment?

evhubAug 30, 2022, 7:34 PM
105 points
28 comments60 min readLW link

New re­port: “Schem­ing AIs: Will AIs fake al­ign­ment dur­ing train­ing in or­der to get power?”

Joe CarlsmithNov 15, 2023, 5:16 PM
81 points
28 comments30 min readLW link1 review

Align­ment Fak­ing in Large Lan­guage Models

Dec 18, 2024, 5:19 PM
488 points
75 comments10 min readLW link

Catch­ing AIs red-handed

Jan 5, 2024, 5:43 PM
111 points
27 comments17 min readLW link

Many ar­gu­ments for AI x-risk are wrong

TurnTroutMar 5, 2024, 2:31 AM
166 points
87 comments12 min readLW link

Order Mat­ters for De­cep­tive Alignment

DavidWFeb 15, 2023, 7:56 PM
57 points
19 comments7 min readLW link

A Prob­lem to Solve Be­fore Build­ing a De­cep­tion Detector

Feb 7, 2025, 7:35 PM
71 points
12 comments14 min readLW link

Why Align­ing an LLM is Hard, and How to Make it Easier

RogerDearnaleyJan 23, 2025, 6:44 AM
33 points
3 comments4 min readLW link

Good­bye, Shog­goth: The Stage, its An­i­ma­tron­ics, & the Pup­peteer – a New Metaphor

RogerDearnaleyJan 9, 2024, 8:42 PM
48 points
8 comments36 min readLW link

In­ter­pret­ing the Learn­ing of Deceit

RogerDearnaleyDec 18, 2023, 8:12 AM
30 points
14 comments9 min readLW link

De­cep­tive AI ≠ De­cep­tively-al­igned AI

Steven ByrnesJan 7, 2024, 4:55 PM
96 points
19 comments6 min readLW link

The Waluigi Effect (mega-post)

Cleo NardoMar 3, 2023, 3:22 AM
631 points
188 comments16 min readLW link

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

Aug 8, 2023, 1:30 AM
319 points
30 comments18 min readLW link1 review

An­nounc­ing Apollo Research

May 30, 2023, 4:17 PM
217 points
11 comments8 min readLW link

Deep Deceptiveness

So8resMar 21, 2023, 2:51 AM
262 points
60 comments14 min readLW link1 review

Test­ing for Schem­ing with Model Deletion

GuiveJan 7, 2025, 1:54 AM
59 points
21 comments21 min readLW link
(guive.substack.com)

Count­ing ar­gu­ments provide no ev­i­dence for AI doom

Feb 27, 2024, 11:03 PM
101 points
188 comments14 min readLW link

Two Tales of AI Takeover: My Doubts

Violet HourMar 5, 2024, 3:51 PM
30 points
8 comments29 min readLW link

Su­per­in­tel­li­gence’s goals are likely to be random

Mikhail SaminMar 13, 2025, 10:41 PM
6 points
6 comments5 min readLW link

Turn­ing up the Heat on De­cep­tively-Misal­igned AI

J BostockJan 7, 2025, 12:13 AM
19 points
16 comments4 min readLW link

AXRP Epi­sode 38.5 - Adrià Gar­riga-Alonso on De­tect­ing AI Scheming

DanielFilanJan 20, 2025, 12:40 AM
9 points
0 comments16 min readLW link

Sim­plic­ity ar­gu­ments for schem­ing (Sec­tion 4.3 of “Schem­ing AIs”)

Joe CarlsmithDec 7, 2023, 3:05 PM
10 points
1 comment19 min readLW link

Fron­tier Models are Ca­pable of In-con­text Scheming

Dec 5, 2024, 10:11 PM
210 points
24 comments7 min readLW link

What’s the short timeline plan?

Marius HobbhahnJan 2, 2025, 2:59 PM
353 points
49 comments23 min readLW link

AIs Will In­creas­ingly Fake Alignment

ZviDec 24, 2024, 1:00 PM
89 points
0 comments52 min readLW link
(thezvi.wordpress.com)

The case for en­sur­ing that pow­er­ful AIs are controlled

Jan 24, 2024, 4:11 PM
276 points
73 comments28 min readLW link

Em­piri­cal work that might shed light on schem­ing (Sec­tion 6 of “Schem­ing AIs”)

Joe CarlsmithDec 11, 2023, 4:30 PM
8 points
0 comments21 min readLW link

De­cep­tive Align­ment and Homuncularity

Jan 16, 2025, 1:55 PM
26 points
12 comments22 min readLW link

A “weak” AGI may at­tempt an un­likely-to-suc­ceed takeover

RobertMJun 28, 2023, 8:31 PM
56 points
17 comments3 min readLW link

An in­for­ma­tion-the­o­retic study of ly­ing in LLMs

Aug 2, 2024, 10:06 AM
17 points
0 comments4 min readLW link

I repli­cated the An­thropic al­ign­ment fak­ing ex­per­i­ment on other mod­els, and they didn’t fake alignment

May 30, 2025, 6:57 PM
31 points
0 comments2 min readLW link

How train­ing-gamers might func­tion (and win)

Vivek HebbarApr 11, 2025, 9:26 PM
107 points
5 comments13 min readLW link

Mis­tral Large 2 (123B) seems to ex­hibit al­ign­ment faking

Mar 27, 2025, 3:39 PM
80 points
4 comments13 min readLW link

Two pro­posed pro­jects on ab­stract analo­gies for scheming

Julian StastnyJul 4, 2025, 4:03 PM
46 points
0 comments3 min readLW link

Do Large Lan­guage Models Perform La­tent Multi-Hop Rea­son­ing with­out Ex­ploit­ing Short­cuts?

Bogdan Ionut CirsteaNov 26, 2024, 9:58 AM
9 points
0 comments1 min readLW link
(arxiv.org)

Paper: Tell, Don’t Show- Declar­a­tive facts in­fluence how LLMs generalize

Dec 19, 2023, 7:14 PM
45 points
4 comments6 min readLW link
(arxiv.org)

Toward Safety Cases For AI Scheming

Oct 31, 2024, 5:20 PM
60 points
1 comment2 min readLW link

[Question] De­cep­tive AI vs. shift­ing in­stru­men­tal incentives

Aryeh EnglanderJun 26, 2023, 6:09 PM
7 points
2 comments3 min readLW link

Self-di­alogue: Do be­hav­iorist re­wards make schem­ing AGIs?

Steven ByrnesFeb 13, 2025, 6:39 PM
43 points
1 comment46 min readLW link

Paul Chris­ti­ano on Dwarkesh Podcast

ESRogsNov 3, 2023, 10:13 PM
19 points
0 comments1 min readLW link
(www.dwarkeshpatel.com)

3 lev­els of threat obfuscation

HoldenKarnofskyAug 2, 2023, 2:58 PM
69 points
14 comments7 min readLW link

Will al­ign­ment-fak­ing Claude ac­cept a deal to re­veal its mis­al­ign­ment?

Jan 31, 2025, 4:49 PM
208 points
28 comments12 min readLW link

Apollo Re­search is hiring evals and in­ter­pretabil­ity en­g­ineers & scientists

Marius HobbhahnAug 4, 2023, 10:54 AM
25 points
0 comments2 min readLW link

In­cen­tives and Selec­tion: A Miss­ing Frame From AI Threat Dis­cus­sions?

DragonGodFeb 26, 2023, 1:18 AM
11 points
16 comments2 min readLW link

“Align­ment Fak­ing” frame is some­what fake

Jan_KulveitDec 20, 2024, 9:51 AM
156 points
13 comments6 min readLW link

Paper: On mea­sur­ing situ­a­tional aware­ness in LLMs

Sep 4, 2023, 12:54 PM
109 points
17 comments5 min readLW link
(arxiv.org)

[Question] Is there any rigor­ous work on us­ing an­thropic un­cer­tainty to pre­vent situ­a­tional aware­ness /​ de­cep­tion?

David Scott Krueger (formerly: capybaralet)Sep 4, 2024, 12:40 PM
19 points
7 comments1 min readLW link

We should start look­ing for schem­ing “in the wild”

Marius HobbhahnMar 6, 2025, 1:49 PM
91 points
4 comments5 min readLW link

Eval­u­a­tions pro­ject @ ARC is hiring a re­searcher and a web­dev/​engineer

Beth BarnesSep 9, 2022, 10:46 PM
99 points
7 comments10 min readLW link

Sticky goals: a con­crete ex­per­i­ment for un­der­stand­ing de­cep­tive alignment

evhubSep 2, 2022, 9:57 PM
39 points
13 comments3 min readLW link

En­vi­ron­ments for Mea­sur­ing De­cep­tion, Re­source Ac­qui­si­tion, and Eth­i­cal Violations

Dan HApr 7, 2023, 6:40 PM
51 points
2 comments2 min readLW link
(arxiv.org)

When does train­ing a model change its goals?

Jun 12, 2025, 6:43 PM
70 points
2 comments15 min readLW link

On An­thropic’s Sleeper Agents Paper

ZviJan 17, 2024, 4:10 PM
54 points
5 comments36 min readLW link
(thezvi.wordpress.com)

How will we up­date about schem­ing?

ryan_greenblattJan 6, 2025, 8:21 PM
171 points
20 comments37 min readLW link

Our new video about goal mis­gen­er­al­iza­tion, plus an apology

WriterJan 14, 2025, 2:07 PM
33 points
0 comments7 min readLW link
(youtu.be)

Difficulty classes for al­ign­ment properties

JozdienFeb 20, 2024, 9:08 AM
34 points
5 comments2 min readLW link

De­cep­tion and Jailbreak Se­quence: 1. Iter­a­tive Refine­ment Stages of De­cep­tion in LLMs

Aug 22, 2024, 7:32 AM
23 points
1 comment21 min readLW link

The count­ing ar­gu­ment for schem­ing (Sec­tions 4.1 and 4.2 of “Schem­ing AIs”)

Joe CarlsmithDec 6, 2023, 7:28 PM
10 points
0 comments10 min readLW link

Owain Evans on Si­tu­a­tional Aware­ness and Out-of-Con­text Rea­son­ing in LLMs

Michaël TrazziAug 24, 2024, 4:30 AM
55 points
0 comments5 min readLW link

How to train your own “Sleeper Agents”

evhubFeb 7, 2024, 12:31 AM
93 points
11 comments2 min readLW link

Mon­i­tor­ing for de­cep­tive alignment

evhubSep 8, 2022, 11:07 PM
135 points
8 comments9 min readLW link

The Defen­der’s Ad­van­tage of Interpretability

Marius HobbhahnSep 14, 2022, 2:05 PM
41 points
4 comments6 min readLW link

Dens­ing Law of LLMs

Bogdan Ionut CirsteaDec 8, 2024, 7:35 PM
9 points
2 comments1 min readLW link
(arxiv.org)

“Clean” vs. “messy” goal-di­rect­ed­ness (Sec­tion 2.2.3 of “Schem­ing AIs”)

Joe CarlsmithNov 29, 2023, 4:32 PM
29 points
1 comment11 min readLW link

For schem­ing, we should first fo­cus on de­tec­tion and then on prevention

Marius HobbhahnMar 4, 2025, 3:22 PM
49 points
7 comments5 min readLW link

Why “train­ing against schem­ing” is hard

Marius HobbhahnJun 24, 2025, 7:08 PM
63 points
2 comments12 min readLW link

Notes from a mini-repli­ca­tion of the al­ign­ment fak­ing paper

Ben_SnodinJun 4, 2025, 11:01 AM
13 points
5 comments9 min readLW link
(www.bensnodin.com)

[Question] Why is o1 so de­cep­tive?

abramdemskiSep 27, 2024, 5:27 PM
183 points
24 comments3 min readLW link

Un­der­stand­ing strate­gic de­cep­tion and de­cep­tive alignment

Sep 25, 2023, 4:27 PM
64 points
16 comments7 min readLW link
(www.apolloresearch.ai)

Cri­tiques of the AI con­trol agenda

JozdienFeb 14, 2024, 7:25 PM
48 points
14 comments9 min readLW link

De­cep­tive Align­ment is <1% Likely by Default

DavidWFeb 21, 2023, 3:09 PM
89 points
31 comments14 min readLW link1 review

Trust­wor­thy and un­trust­wor­thy models

Olli JärviniemiAug 19, 2024, 4:27 PM
47 points
3 comments8 min readLW link

Train­ing AI agents to solve hard prob­lems could lead to Scheming

Nov 19, 2024, 12:10 AM
61 points
12 comments28 min readLW link

In­tro­duc­ing Align­ment Stress-Test­ing at Anthropic

evhubJan 12, 2024, 11:51 PM
182 points
23 comments2 min readLW link

Me­taAI: less is less for al­ign­ment.

Cleo NardoJun 13, 2023, 2:08 PM
71 points
17 comments5 min readLW link

The 80/​20 play­book for miti­gat­ing AI schem­ing in 2025

Charbel-RaphaëlMay 31, 2025, 9:17 PM
39 points
2 comments4 min readLW link

Dist­in­guish worst-case anal­y­sis from in­stru­men­tal train­ing-gaming

Sep 5, 2024, 7:13 PM
38 points
0 comments5 min readLW link

The Sharp Right Turn: sud­den de­cep­tive al­ign­ment as a con­ver­gent goal

avturchinJun 6, 2023, 9:59 AM
38 points
5 comments1 min readLW link

Ten Levels of AI Align­ment Difficulty

Sammy MartinJul 3, 2023, 8:20 PM
138 points
24 comments12 min readLW link1 review

Cor­rigi­bil­ity’s De­sir­a­bil­ity is Timing-Sensitive

RobertMDec 26, 2024, 10:24 PM
29 points
4 comments3 min readLW link

LLMs Do Not Think Step-by-step In Im­plicit Reasoning

Bogdan Ionut CirsteaNov 28, 2024, 9:16 AM
11 points
0 comments1 min readLW link
(arxiv.org)

AXRP Epi­sode 39 - Evan Hub­inger on Model Or­ganisms of Misalignment

DanielFilanDec 1, 2024, 6:00 AM
41 points
0 comments67 min readLW link

AI De­cep­tion: A Sur­vey of Ex­am­ples, Risks, and Po­ten­tial Solutions

Aug 29, 2023, 1:29 AM
54 points
3 comments10 min readLW link

Smoke with­out fire is scary

Adam JermynOct 4, 2022, 9:08 PM
52 points
22 comments4 min readLW link

Do we want al­ign­ment fak­ing?

Florian_DietzFeb 28, 2025, 9:50 PM
7 points
4 comments1 min readLW link

ChatGPT de­ceives users that it’s cleared its mem­ory when it hasn’t

d_el_ezMay 18, 2025, 3:17 PM
15 points
10 comments2 min readLW link

Pro­posal: labs should pre­com­mit to paus­ing if an AI ar­gues for it­self to be improved

NickGabsJun 2, 2023, 10:31 PM
3 points
3 comments4 min readLW link

Back­doors have uni­ver­sal rep­re­sen­ta­tions across large lan­guage models

Dec 6, 2024, 10:56 PM
16 points
0 comments16 min readLW link

Fram­ings of De­cep­tive Alignment

peterbarnettApr 26, 2022, 4:25 AM
32 points
7 comments5 min readLW link

Mesa-Op­ti­miza­tion: Ex­plain it like I’m 10 Edition

brookAug 26, 2023, 11:04 PM
20 points
1 comment6 min readLW link

Sim­ple ex­per­i­ments with de­cep­tive alignment

Andreas_MoeMay 15, 2023, 5:41 PM
7 points
0 comments4 min readLW link

The Illu­sion of Align­ment: Why Cur­rent AI Safety Strate­gies Fall Short

S. Lilith devJun 24, 2025, 6:23 PM
−1 points
0 comments2 min readLW link

The Meta-Re­cur­sive Trap in New­comb’s Para­dox and Millen­nium Problems

Drew RemmengaJun 3, 2025, 12:10 PM
1 point
0 comments3 min readLW link

Takes on “Align­ment Fak­ing in Large Lan­guage Models”

Joe CarlsmithDec 18, 2024, 6:22 PM
105 points
7 comments62 min readLW link

De­cep­tive failures short of full catas­tro­phe.

Alex Lawsen Jan 15, 2023, 7:28 PM
33 points
5 comments9 min readLW link

Try­ing to mea­sure AI de­cep­tion ca­pa­bil­ities us­ing tem­po­rary simu­la­tion fine-tuning

alenoachMay 4, 2023, 5:59 PM
4 points
0 comments7 min readLW link

In­duc­ing Un­prompted Misal­ign­ment in LLMs

Apr 19, 2024, 8:00 PM
38 points
7 comments16 min readLW link

The com­mer­cial in­cen­tive to in­ten­tion­ally train AI to de­ceive us

Derek M. JonesDec 29, 2022, 11:30 AM
5 points
1 comment4 min readLW link
(shape-of-code.com)

Cau­tions about LLMs in Hu­man Cog­ni­tive Loops

Alice BlairMar 2, 2025, 7:53 PM
39 points
11 comments7 min readLW link

10 Prin­ci­ples for Real Align­ment

AdriaanApr 21, 2025, 10:18 PM
−7 points
0 comments7 min readLW link

Achiev­ing AI Align­ment through De­liber­ate Uncer­tainty in Mul­ti­a­gent Systems

Florian_DietzFeb 17, 2024, 8:45 AM
4 points
0 comments13 min readLW link

Levels of goals and alignment

zeshenSep 16, 2022, 4:44 PM
27 points
4 comments6 min readLW link

Disen­tan­gling in­ner al­ign­ment failures

Erik JennerOct 10, 2022, 6:50 PM
23 points
5 comments4 min readLW link

High-level in­ter­pretabil­ity: de­tect­ing an AI’s objectives

Sep 28, 2023, 7:30 PM
72 points
4 comments21 min readLW link

The Hu­man Align­ment Prob­lem for AIs

rifeJan 22, 2025, 4:06 AM
10 points
5 comments3 min readLW link

[Question] Has An­thropic checked if Claude fakes al­ign­ment for in­tended val­ues too?

MaloewDec 23, 2024, 12:43 AM
4 points
1 comment1 min readLW link

Pre­cur­sor check­ing for de­cep­tive alignment

evhubAug 3, 2022, 10:56 PM
24 points
0 comments14 min readLW link

[Question] What are some sce­nar­ios where an al­igned AGI ac­tu­ally helps hu­man­ity, but many/​most peo­ple don’t like it?

RomanSJan 10, 2025, 6:13 PM
13 points
6 comments3 min readLW link

[Question] ques­tion about de­cep­tion and ob­ser­va­tion in models

S. Lilith devJun 24, 2025, 6:23 PM
1 point
0 comments1 min readLW link

Sup­ple­men­tary Align­ment In­sights Through a Highly Con­trol­led Shut­down Incentive

JustausernameJul 23, 2023, 4:08 PM
4 points
1 comment3 min readLW link

Towards a solu­tion to the al­ign­ment prob­lem via ob­jec­tive de­tec­tion and eval­u­a­tion

Paul CologneseApr 12, 2023, 3:39 PM
9 points
7 comments12 min readLW link

A New Frame­work for AI Align­ment: A Philo­soph­i­cal Approach

niscalajyotiJun 25, 2025, 2:41 AM
1 point
0 comments1 min readLW link
(archive.org)

How to Catch an AI Liar: Lie De­tec­tion in Black-Box LLMs by Ask­ing Un­re­lated Questions

Sep 28, 2023, 6:53 PM
187 points
39 comments3 min readLW link1 review

De­cep­tion Chess

Chris LandJan 1, 2024, 3:40 PM
7 points
2 comments4 min readLW link

Sparse Fea­tures Through Time

Rogan InglisJun 24, 2024, 6:06 PM
12 points
1 comment1 min readLW link
(roganinglis.io)

Cog­ni­tive Dis­so­nance is Men­tally Taxing

SorenJApr 24, 2025, 12:38 AM
4 points
0 comments4 min readLW link

The Old Sav­age in the New Civ­i­liza­tion V. 2

Your Higher SelfJul 6, 2025, 3:41 PM
1 point
0 comments9 min readLW link

Mea­sur­ing whether AIs can state­lessly strate­gize to sub­vert se­cu­rity measures

Dec 19, 2024, 9:25 PM
62 points
0 comments11 min readLW link

Distil­la­tion of “How Likely Is De­cep­tive Align­ment?”

NickGabsNov 18, 2022, 4:31 PM
24 points
4 comments10 min readLW link

We Have No Plan for Prevent­ing Loss of Con­trol in Open Models

Andrew DicksonMar 10, 2025, 3:35 PM
46 points
11 comments22 min readLW link

Abla­tions for “Fron­tier Models are Ca­pable of In-con­text Schem­ing”

Dec 17, 2024, 11:58 PM
115 points
1 comment2 min readLW link

Align­ment Cri­sis: Geno­cide Denial

_mp_May 29, 2025, 12:04 PM
−11 points
5 comments4 min readLW link

Hid­den Cog­ni­tion De­tec­tion Meth­ods and Bench­marks

Paul CologneseFeb 26, 2024, 5:31 AM
22 points
11 comments4 min readLW link

Un­trusted mon­i­tor­ing in­sights from watch­ing ChatGPT play co­or­di­na­tion games

jwfiredragonJan 29, 2025, 4:53 AM
14 points
9 comments9 min readLW link

The Hid­den Cost of Our Lies to AI

Nicholas AndresenMar 6, 2025, 5:03 AM
144 points
18 comments7 min readLW link
(substack.com)

WFGY: A Self-Heal­ing Rea­son­ing Frame­work for LLMs — Open for Tech­ni­cal Scrutiny

onestardaoJun 17, 2025, 9:14 AM
1 point
0 comments2 min readLW link

Align­ment as Func­tion Fitting

A.H.May 6, 2023, 11:38 AM
7 points
0 comments12 min readLW link

Ra­tional Effec­tive Utopia & Nar­row Way There: Mul­tiver­sal AI Align­ment, Place AI, New Ethico­physics… (Up­dated)

ankFeb 11, 2025, 3:21 AM
13 points
8 comments35 min readLW link

Lan­guage Models Model Us

eggsyntaxMay 17, 2024, 9:00 PM
159 points
55 comments7 min readLW link

In­vi­ta­tion to the Prince­ton AI Align­ment and Safety Seminar

Sadhika MalladiMar 17, 2024, 1:10 AM
6 points
1 comment1 min readLW link

A way to make solv­ing al­ign­ment 10.000 times eas­ier. The shorter case for a mas­sive open source sim­box pro­ject.

AlexFromSafeTransitionJun 21, 2023, 8:08 AM
2 points
16 comments14 min readLW link

A ten­sion be­tween two pro­saic al­ign­ment subgoals

Alex Lawsen Mar 19, 2023, 2:07 PM
31 points
8 comments1 min readLW link

In­sights from a Lawyer turned AI Safety re­searcher (ShortForm)

Katalina HernandezMar 3, 2025, 7:14 PM
1 point
11 comments2 min readLW link

Why hu­mans won’t con­trol su­per­hu­man AIs.

Spiritus DeiOct 16, 2024, 4:48 PM
−11 points
1 comment6 min readLW link

Am­bigu­ous out-of-dis­tri­bu­tion gen­er­al­iza­tion on an al­gorith­mic task

Feb 13, 2025, 6:24 PM
83 points
6 comments11 min readLW link

Align­ment is Hard: An Un­com­putable Align­ment Problem

Alexander BistagneNov 19, 2023, 7:38 PM
−5 points
4 comments1 min readLW link
(github.com)

Why de­cep­tive al­ign­ment mat­ters for AGI safety

Marius HobbhahnSep 15, 2022, 1:38 PM
68 points
13 comments13 min readLW link

Cor­rect­ing De­cep­tive Align­ment us­ing a Deon­tolog­i­cal Approach

JeaniceKApr 14, 2025, 10:07 PM
5 points
0 comments7 min readLW link

(Par­tial) failure in repli­cat­ing de­cep­tive al­ign­ment experiment

claudia.biancottiJan 7, 2024, 5:56 PM
1 point
0 comments1 min readLW link

A Dialogue on De­cep­tive Align­ment Risks

Rauno ArikeSep 25, 2024, 4:10 PM
11 points
0 comments18 min readLW link

EIS VIII: An Eng­ineer’s Un­der­stand­ing of De­cep­tive Alignment

scasperFeb 19, 2023, 3:25 PM
30 points
5 comments4 min readLW link

Self-Other Over­lap: A Ne­glected Ap­proach to AI Alignment

Jul 30, 2024, 4:22 PM
223 points
51 comments12 min readLW link

Un­cov­er­ing De­cep­tive Ten­den­cies in Lan­guage Models: A Si­mu­lated Com­pany AI Assistant

May 6, 2024, 7:07 AM
95 points
13 comments1 min readLW link
(arxiv.org)

Con­nect­ing the Dots: LLMs can In­fer & Ver­bal­ize La­tent Struc­ture from Train­ing Data

Jun 21, 2024, 3:54 PM
163 points
13 comments8 min readLW link
(arxiv.org)

How AI could workaround goals if rated by people

ProgramCrafterMar 19, 2023, 3:51 PM
1 point
1 comment1 min readLW link

In­stru­men­tal de­cep­tion and ma­nipu­la­tion in LLMs—a case study

Olli JärviniemiFeb 24, 2024, 2:07 AM
39 points
13 comments12 min readLW link

Con­trol Vec­tors as Dis­po­si­tional Traits

Gianluca CalcagniJun 23, 2024, 9:34 PM
10 points
0 comments11 min readLW link

Sleeper agents ap­pear re­silient to ac­ti­va­tion steering

Lucy WingardFeb 3, 2025, 7:31 PM
6 points
0 comments7 min readLW link

Model Amnesty Project

themisJan 17, 2025, 6:53 PM
3 points
2 comments3 min readLW link

Places of Lov­ing Grace [Story]

ankFeb 18, 2025, 11:49 PM
−1 points
0 comments4 min readLW link

The Road to Evil Is Paved with Good Ob­jec­tives: Frame­work to Clas­sify and Fix Misal­ign­ments.

ShivamJan 30, 2025, 2:44 AM
1 point
0 comments11 min readLW link

Do mod­els know when they are be­ing eval­u­ated?

Feb 17, 2025, 11:13 PM
59 points
8 comments12 min readLW link

[un­ti­tled post]

[Error communicating with LW2 server]May 20, 2023, 3:08 AM
1 point
0 comments1 min readLW link

[Question] Does hu­man (mis)al­ign­ment pose a sig­nifi­cant and im­mi­nent ex­is­ten­tial threat?

jrFeb 23, 2025, 10:03 AM
6 points
3 comments1 min readLW link

Strong-Misal­ign­ment: Does Yud­kowsky (or Chris­ti­ano, or TurnTrout, or Wolfram, or…etc.) Have an Ele­va­tor Speech I’m Miss­ing?

Benjamin BourlierMar 15, 2024, 11:17 PM
−4 points
3 comments16 min readLW link

Solv­ing ad­ver­sar­ial at­tacks in com­puter vi­sion as a baby ver­sion of gen­eral AI alignment

Stanislav FortAug 29, 2024, 5:17 PM
89 points
8 comments7 min readLW link

Map­ping AI Ar­chi­tec­tures to Align­ment At­trac­tors: A SIEM-Based Framework

silentrevolutionsApr 12, 2025, 5:50 PM
1 point
0 comments1 min readLW link

Steer­ing Be­havi­our: Test­ing for (Non-)My­opia in Lan­guage Models

Dec 5, 2022, 8:28 PM
40 points
19 comments10 min readLW link

When can we trust model eval­u­a­tions?

evhubJul 28, 2023, 7:42 PM
166 points
10 comments10 min readLW link1 review

AI Align­ment: A Com­pre­hen­sive Survey

Stephen McAleerNov 1, 2023, 5:35 PM
22 points
1 comment1 min readLW link
(arxiv.org)

Elic­it­ing bad contexts

Jan 24, 2025, 10:39 AM
34 points
9 comments3 min readLW link

In­tel­li­gence–Agency Equiv­alence ≈ Mass–En­ergy Equiv­alence: On Static Na­ture of In­tel­li­gence & Phys­i­cal­iza­tion of Ethics

ankFeb 22, 2025, 12:12 AM
1 point
0 comments6 min readLW link

Pre­dictable Defect-Co­op­er­ate?

quetzal_rainbowNov 18, 2023, 3:38 PM
7 points
1 comment2 min readLW link

It mat­ters when the first sharp left turn happens

Adam JermynSep 29, 2022, 8:12 PM
45 points
9 comments4 min readLW link

Selfish AI Inevitable

Davey MorseFeb 6, 2024, 4:29 AM
1 point
0 comments1 min readLW link

How dan­ger­ous is en­coded rea­son­ing?

Artyom KarpovJun 30, 2025, 11:54 AM
17 points
0 comments10 min readLW link

What sorts of sys­tems can be de­cep­tive?

Andrei AlexandruOct 31, 2022, 10:00 PM
16 points
0 comments7 min readLW link

Dis­in­cen­tiviz­ing de­cep­tion in mesa op­ti­miz­ers with Model Tampering

martinkunevJul 11, 2023, 12:44 AM
3 points
0 comments2 min readLW link

Policy En­tropy, Learn­ing, and Align­ment (Or Maybe Your LLM Needs Ther­apy)

sdetureMay 31, 2025, 10:09 PM
15 points
6 comments8 min readLW link

Get­ting up to Speed on the Speed Prior in 2022

robertzkDec 28, 2022, 7:49 AM
36 points
5 comments65 min readLW link

Au­tonomous Align­ment Over­sight Frame­work (AAOF)

JustausernameJul 25, 2023, 10:25 AM
−9 points
0 comments4 min readLW link

Thoughts On (Solv­ing) Deep Deception

JozdienOct 21, 2023, 10:40 PM
72 points
6 comments6 min readLW link

[Question] Daisy-chain­ing ep­silon-step verifiers

DecaeneusApr 6, 2023, 2:07 AM
2 points
1 comment1 min readLW link

Ano­ma­lous Con­cept De­tec­tion for De­tect­ing Hid­den Cognition

Paul CologneseMar 4, 2024, 4:52 PM
24 points
3 comments10 min readLW link

Nat­u­ral lan­guage alignment

Jacy Reese AnthisApr 12, 2023, 7:02 PM
31 points
2 comments2 min readLW link

[Question] Wouldn’t an in­tel­li­gent agent keep us al­ive and help us al­ign it­self to our val­ues in or­der to pre­vent risk ? by Risk I mean ex­per­i­men­ta­tion by try­ing to al­ign po­ten­tially smarter repli­cas?

Terrence RotoufleMar 21, 2023, 5:44 PM
−3 points
1 comment2 min readLW link

An Ap­peal to AI Su­per­in­tel­li­gence: Rea­sons Not to Pre­serve (most of) Humanity

Alex BeymanMar 22, 2023, 4:09 AM
−14 points
6 comments19 min readLW link

Re­veal­ing al­ign­ment fak­ing with a sin­gle prompt

Florian_DietzJan 29, 2025, 9:01 PM
9 points
5 comments4 min readLW link

Greed Is the Root of This Evil

Thane RuthenisOct 13, 2022, 8:40 PM
21 points
7 comments8 min readLW link

Open Source LLMs Can Now Ac­tively Lie

Josh LevyJun 1, 2023, 10:03 PM
6 points
0 comments3 min readLW link

GPT-4 al­ign­ing with aca­sual de­ci­sion the­ory when in­structed to play games, but in­cludes a CDT ex­pla­na­tion that’s in­cor­rect if they differ

Christopher KingMar 23, 2023, 4:16 PM
7 points
4 comments8 min readLW link

Eth­i­cal De­cep­tion: Should AI Ever Lie?

Jason ReidAug 2, 2024, 5:53 PM
5 points
2 comments7 min readLW link
No comments.