De­cep­tive Alignment

TagLast edit: 8 Feb 2023 14:24 UTC by Roman Leventov

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI.

See also: Mesa-optimization, Treacherous Turn, Eliciting Latent Knowledge, Deception

De­cep­tive Alignment

5 Jun 2019 20:16 UTC
100 points
19 comments17 min readLW link

The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC
586 points
175 comments16 min readLW link

How likely is de­cep­tive al­ign­ment?

evhub30 Aug 2022 19:34 UTC
89 points
22 comments60 min readLW link

Try­ing to Make a Treach­er­ous Mesa-Optimizer

MadHatter9 Nov 2022 18:07 UTC
91 points
14 comments4 min readLW link

Sticky goals: a con­crete ex­per­i­ment for un­der­stand­ing de­cep­tive alignment

evhub2 Sep 2022 21:57 UTC
35 points
13 comments3 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark Xu6 Nov 2020 23:48 UTC
85 points
9 comments16 min readLW link

Mon­i­tor­ing for de­cep­tive alignment

evhub8 Sep 2022 23:07 UTC
122 points
8 comments9 min readLW link

Eval­u­a­tions pro­ject @ ARC is hiring a re­searcher and a web­dev/​engineer

Beth Barnes9 Sep 2022 22:46 UTC
98 points
7 comments10 min readLW link

The Defen­der’s Ad­van­tage of Interpretability

Marius Hobbhahn14 Sep 2022 14:05 UTC
41 points
4 comments6 min readLW link

Order Mat­ters for De­cep­tive Alignment

DavidW15 Feb 2023 19:56 UTC
47 points
16 comments7 min readLW link

In­cen­tives and Selec­tion: A Miss­ing Frame From AI Threat Dis­cus­sions?

DragonGod26 Feb 2023 1:18 UTC
11 points
16 comments2 min readLW link

Fram­ings of De­cep­tive Alignment

peterbarnett26 Apr 2022 4:25 UTC
26 points
6 comments5 min readLW link

Pre­cur­sor check­ing for de­cep­tive alignment

evhub3 Aug 2022 22:56 UTC
22 points
0 comments14 min readLW link

Why de­cep­tive al­ign­ment mat­ters for AGI safety

Marius Hobbhahn15 Sep 2022 13:38 UTC
48 points
13 comments13 min readLW link

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC
27 points
4 comments6 min readLW link

It mat­ters when the first sharp left turn happens

Adam Jermyn29 Sep 2022 20:12 UTC
35 points
9 comments4 min readLW link

Smoke with­out fire is scary

Adam Jermyn4 Oct 2022 21:08 UTC
49 points
22 comments4 min readLW link

Disen­tan­gling in­ner al­ign­ment failures

Erik Jenner10 Oct 2022 18:50 UTC
14 points
5 comments4 min readLW link

Greed Is the Root of This Evil

Thane Ruthenis13 Oct 2022 20:40 UTC
21 points
4 comments8 min readLW link

What sorts of sys­tems can be de­cep­tive?

Andrei Alexandru31 Oct 2022 22:00 UTC
15 points
0 comments7 min readLW link

Distil­la­tion of “How Likely Is De­cep­tive Align­ment?”

NickGabs18 Nov 2022 16:31 UTC
20 points
4 comments10 min readLW link

Steer­ing Be­havi­our: Test­ing for (Non-)My­opia in Lan­guage Models

5 Dec 2022 20:28 UTC
38 points
17 comments10 min readLW link

Get­ting up to Speed on the Speed Prior in 2022

robertzk28 Dec 2022 7:49 UTC
33 points
5 comments65 min readLW link

The com­mer­cial in­cen­tive to in­ten­tion­ally train AI to de­ceive us

Derek M. Jones29 Dec 2022 11:30 UTC
5 points
1 comment4 min readLW link

De­cep­tive failures short of full catas­tro­phe.

Alex Lawsen 15 Jan 2023 19:28 UTC
33 points
5 comments9 min readLW link

De­cep­tive Align­ment is <1% Likely by Default

DavidW21 Feb 2023 15:09 UTC
51 points
14 comments10 min readLW link

EIS VIII: An Eng­ineer’s Un­der­stand­ing of De­cep­tive Alignment

scasper19 Feb 2023 15:25 UTC
14 points
4 comments4 min readLW link

How AI could workaround goals if rated by people

ProgramCrafter19 Mar 2023 15:51 UTC
1 point
1 comment1 min readLW link

An Ap­peal to AI Su­per­in­tel­li­gence: Rea­sons Not to Pre­serve (most of) Humanity

Alex Beyman22 Mar 2023 4:09 UTC
−15 points
6 comments19 min readLW link

A ten­sion be­tween two pro­saic al­ign­ment subgoals

Alex Lawsen 19 Mar 2023 14:07 UTC
30 points
8 comments1 min readLW link

[Question] Wouldn’t an in­tel­li­gent agent keep us al­ive and help us al­ign it­self to our val­ues in or­der to pre­vent risk ? by Risk I mean ex­per­i­men­ta­tion by try­ing to al­ign po­ten­tially smarter repli­cas?

Terrence Rotoufle21 Mar 2023 17:44 UTC
−3 points
1 comment2 min readLW link

GPT-4 al­ign­ing with aca­sual de­ci­sion the­ory when in­structed to play games, but in­cludes a CDT ex­pla­na­tion that’s in­cor­rect if they differ

Christopher King23 Mar 2023 16:16 UTC
7 points
4 comments8 min readLW link
No comments.