RSS

Elic­it­ing La­tent Knowl­edge (ELK)

TagLast edit: 31 Mar 2022 3:06 UTC by Multicore

Eliciting Latent Knowledge is an open problem in AI safety.

Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.

But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.

In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?

--ARC report

See also: Transparency/​Interpretability

ARC’s first tech­ni­cal re­port: Elic­it­ing La­tent Knowledge

14 Dec 2021 20:09 UTC
223 points
92 comments1 min readLW link3 reviews
(docs.google.com)

Mechanis­tic anomaly de­tec­tion and ELK

paulfchristiano25 Nov 2022 18:50 UTC
130 points
18 comments21 min readLW link
(ai-alignment.com)

ELK prize results

9 Mar 2022 0:01 UTC
132 points
50 comments21 min readLW link

Find­ing gliders in the game of life

paulfchristiano1 Dec 2022 20:40 UTC
92 points
7 comments16 min readLW link
(ai-alignment.com)

Coun­terex­am­ples to some ELK proposals

paulfchristiano31 Dec 2021 17:05 UTC
50 points
10 comments7 min readLW link

Prizes for ELK proposals

paulfchristiano3 Jan 2022 20:23 UTC
142 points
156 comments7 min readLW link

Elic­it­ing La­tent Knowl­edge Via Hy­po­thet­i­cal Sensors

John_Maxwell30 Dec 2021 15:53 UTC
38 points
2 comments6 min readLW link

Im­por­tance of fore­sight eval­u­a­tions within ELK

Jonathan Uesato6 Jan 2022 15:34 UTC
25 points
1 comment10 min readLW link

ELK First Round Con­test Winners

26 Jan 2022 2:56 UTC
63 points
6 comments1 min readLW link

ELK Pro­posal: Think­ing Via A Hu­man Imitator

TurnTrout22 Feb 2022 1:52 UTC
28 points
6 comments11 min readLW link

Im­pli­ca­tions of au­to­mated on­tol­ogy identification

18 Feb 2022 3:30 UTC
67 points
29 comments23 min readLW link

My Reser­va­tions about Dis­cov­er­ing La­tent Knowl­edge (Burns, Ye, et al)

Robert_AIZI27 Dec 2022 17:27 UTC
39 points
0 comments4 min readLW link
(aizi.substack.com)

Un­der­stand­ing the two-head strat­egy for teach­ing ML to an­swer ques­tions honestly

Adam Scherlis11 Jan 2022 23:24 UTC
28 points
1 comment10 min readLW link

Is ELK enough? Di­a­mond, Ma­trix and Child AI

adamShimi15 Feb 2022 2:29 UTC
17 points
10 comments4 min readLW link

What Does The Nat­u­ral Ab­strac­tion Frame­work Say About ELK?

johnswentworth15 Feb 2022 2:27 UTC
34 points
0 comments6 min readLW link

Some Hacky ELK Ideas

johnswentworth15 Feb 2022 2:27 UTC
34 points
8 comments5 min readLW link

REPL’s: a type sig­na­ture for agents

scottviteri15 Feb 2022 22:57 UTC
24 points
5 comments2 min readLW link

Two Challenges for ELK

derek shiller21 Feb 2022 5:49 UTC
7 points
0 comments4 min readLW link

ELK Thought Dump

abramdemski28 Feb 2022 18:46 UTC
58 points
18 comments17 min readLW link

Mus­ings on the Speed Prior

evhub2 Mar 2022 4:04 UTC
21 points
4 comments10 min readLW link

ELK Sub—Note-tak­ing in in­ter­nal rollouts

Hoagy9 Mar 2022 17:23 UTC
6 points
0 comments5 min readLW link

ELK con­test sub­mis­sion: route un­der­stand­ing through the hu­man ontology

14 Mar 2022 21:42 UTC
21 points
2 comments2 min readLW link

[Question] Can you be Not Even Wrong in AI Align­ment?

throwaway823819 Mar 2022 17:41 UTC
22 points
7 comments8 min readLW link

[ASoT] Ob­ser­va­tions about ELK

leogao26 Mar 2022 0:42 UTC
31 points
0 comments3 min readLW link

Towards a bet­ter cir­cuit prior: Im­prov­ing on ELK state-of-the-art

29 Mar 2022 1:56 UTC
19 points
0 comments15 min readLW link

ELK Com­pu­ta­tional Com­plex­ity: Three Levels of Difficulty

abramdemski30 Mar 2022 20:56 UTC
46 points
9 comments7 min readLW link

If you’re very op­ti­mistic about ELK then you should be op­ti­mistic about outer alignment

Sam Marks27 Apr 2022 19:30 UTC
17 points
8 comments3 min readLW link

Note-Tak­ing with­out Hid­den Messages

Hoagy30 Apr 2022 11:15 UTC
7 points
1 comment4 min readLW link

Clar­ify­ing what ELK is try­ing to achieve

Simon Skade21 May 2022 7:34 UTC
7 points
0 comments5 min readLW link

A rough idea for solv­ing ELK: An ap­proach for train­ing gen­er­al­ist agents like GATO to make plans and de­scribe them to hu­mans clearly and hon­estly.

Michael Soareverix8 Sep 2022 15:20 UTC
2 points
2 comments2 min readLW link

[Question] How is ARC plan­ning to use ELK?

jacquesthibs15 Dec 2022 20:11 UTC
24 points
5 comments1 min readLW link

Can we effi­ciently ex­plain model be­hav­iors?

paulfchristiano16 Dec 2022 19:40 UTC
64 points
3 comments9 min readLW link
(ai-alignment.com)

Col­lin Burns on Align­ment Re­search And Dis­cov­er­ing La­tent Knowl­edge Without Supervision

Michaël Trazzi17 Jan 2023 17:21 UTC
25 points
5 comments4 min readLW link
(theinsideview.ai)

What Dis­cov­er­ing La­tent Knowl­edge Did and Did Not Find

Fabien Roger13 Mar 2023 19:29 UTC
133 points
10 comments11 min readLW link

[ASoT] Some thoughts on hu­man abstractions

leogao16 Mar 2023 5:42 UTC
37 points
4 comments5 min readLW link

The Greedy Doc­tor Prob­lem… turns out to be rele­vant to the ELK prob­lem?

Jan14 Jan 2022 11:58 UTC
33 points
10 comments14 min readLW link
(universalprior.substack.com)

REPL’s and ELK

scottviteri17 Feb 2022 1:14 UTC
9 points
4 comments1 min readLW link

[ASoT] Some ways ELK could still be solv­able in practice

leogao27 Mar 2022 1:15 UTC
26 points
1 comment2 min readLW link

Vaniver’s ELK Submission

Vaniver28 Mar 2022 21:14 UTC
10 points
0 comments7 min readLW link

Is GPT3 a Good Ra­tion­al­ist? - In­struc­tGPT3 [2/​2]

simeon_c7 Apr 2022 13:46 UTC
11 points
0 comments7 min readLW link

ELK shaving

Miss Aligned AI1 May 2022 21:05 UTC
6 points
1 comment1 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
51 points
0 comments59 min readLW link

Croe­sus, Cer­berus, and the mag­pies: a gen­tle in­tro­duc­tion to Elic­it­ing La­tent Knowledge

Alexandre Variengien27 May 2022 17:58 UTC
14 points
0 comments16 min readLW link

Elic­it­ing La­tent Knowl­edge (ELK) - Distil­la­tion/​Summary

Marius Hobbhahn8 Jun 2022 13:18 UTC
49 points
2 comments21 min readLW link

ELK Pro­posal—Make the Re­porter care about the Pre­dic­tor’s beliefs

11 Jun 2022 22:53 UTC
8 points
0 comments6 min readLW link

Bounded com­plex­ity of solv­ing ELK and its implications

Rubi J. Hudson19 Jul 2022 6:56 UTC
11 points
4 comments18 min readLW link

Abram Dem­ski’s ELK thoughts and pro­posal—distillation

Rubi J. Hudson19 Jul 2022 6:57 UTC
15 points
4 comments16 min readLW link

Sur­prised by ELK re­port’s coun­terex­am­ple to De­bate, IDA

Evan R. Murphy4 Aug 2022 2:12 UTC
18 points
0 comments5 min readLW link

Limits of Ask­ing ELK if Models are Deceptive

Oam Patel15 Aug 2022 20:44 UTC
6 points
2 comments4 min readLW link

A Bite Sized In­tro­duc­tion to ELK

Luk2718217 Sep 2022 0:28 UTC
5 points
0 comments6 min readLW link

How To Know What the AI Knows—An ELK Distillation

Fabien Roger4 Sep 2022 0:46 UTC
7 points
0 comments5 min readLW link

Rep­re­sen­ta­tional Tethers: Ty­ing AI La­tents To Hu­man Ones

Paul Bricman16 Sep 2022 14:45 UTC
30 points
0 comments16 min readLW link

The ELK Fram­ing I’ve Used

sudo -i19 Sep 2022 10:28 UTC
4 points
1 comment1 min readLW link

Where I cur­rently dis­agree with Ryan Green­blatt’s ver­sion of the ELK approach

So8res29 Sep 2022 21:18 UTC
63 points
7 comments5 min readLW link

Log­i­cal De­ci­sion The­o­ries: Our fi­nal failsafe?

Noosphere8925 Oct 2022 12:51 UTC
−7 points
8 comments1 min readLW link
(www.lesswrong.com)

For ELK truth is mostly a distraction

c.trout4 Nov 2022 21:14 UTC
44 points
0 comments21 min readLW link

You won’t solve al­ign­ment with­out agent foundations

Mikhail Samin6 Nov 2022 8:07 UTC
21 points
3 comments8 min readLW link

The limited up­side of interpretability

Peter S. Park15 Nov 2022 18:46 UTC
13 points
11 comments1 min readLW link

ARC pa­per: For­mal­iz­ing the pre­sump­tion of independence

Erik Jenner20 Nov 2022 1:22 UTC
91 points
2 comments2 min readLW link
(arxiv.org)

Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Supervision

Xodarap14 Dec 2022 12:32 UTC
45 points
1 comment1 min readLW link
(arxiv.org)

Ar­ti­cle Re­view: Dis­cov­er­ing La­tent Knowl­edge (Burns, Ye, et al)

Robert_AIZI22 Dec 2022 18:16 UTC
13 points
4 comments6 min readLW link
(aizi.substack.com)

How “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion” Fits Into a Broader Align­ment Scheme

Collin15 Dec 2022 18:22 UTC
222 points
31 comments16 min readLW link

Can we effi­ciently dis­t­in­guish differ­ent mechanisms?

paulfchristiano27 Dec 2022 0:20 UTC
85 points
25 comments16 min readLW link
(ai-alignment.com)

[ASoT] Si­mu­la­tors show us be­havi­oural prop­er­ties by default

Jozdien13 Jan 2023 18:42 UTC
26 points
1 comment3 min readLW link

[RFC] Pos­si­ble ways to ex­pand on “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion”.

25 Jan 2023 19:03 UTC
41 points
6 comments12 min readLW link

Search­ing for a model’s con­cepts by their shape – a the­o­ret­i­cal framework

23 Feb 2023 20:14 UTC
36 points
0 comments19 min readLW link

Thoughts on self-in­spect­ing neu­ral net­works.

Deruwyn12 Mar 2023 23:58 UTC
4 points
1 comment5 min readLW link
No comments.