Eliciting Latent Knowledge

TagLast edit: 17 Jan 2025 22:04 UTC by Dakara

Eliciting Latent Knowledge is an open problem in AI safety.

Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.
But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.
In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?

See also: Transparency/Interpretability

ARC’s first technical report: Eliciting Latent Knowledge

paulfchristiano, Mark Xu and Ajeya Cotra

14 Dec 2021 20:09 UTC

228 points

90 comments1 min readLW link 3 reviews

(docs.google.com)

Mechanistic anomaly detection and ELK

paulfchristiano25 Nov 2022 18:50 UTC

138 points

22 comments21 min readLW link

(ai-alignment.com)

Finding gliders in the game of life

paulfchristiano1 Dec 2022 20:40 UTC

104 points

8 comments16 min readLW link

(ai-alignment.com)

ELK prize results

paulfchristiano and Mark Xu

9 Mar 2022 0:01 UTC

138 points

50 comments21 min readLW link

Counterexamples to some ELK proposals

paulfchristiano31 Dec 2021 17:05 UTC

53 points

10 comments7 min readLW link

Prizes for ELK proposals

paulfchristiano3 Jan 2022 20:23 UTC

150 points

152 comments7 min readLW link

Robustness of Contrast-Consistent Search to Adversarial Prompting

Nandi, i, Jamie Wright, Seamus_F and hugofry

1 Nov 2023 12:46 UTC

18 points

1 comment7 min readLW link

Towards a better circuit prior: Improving on ELK state-of-the-art

evhub and kcwoolverton

29 Mar 2022 1:56 UTC

23 points

0 comments15 min readLW link

Importance of foresight evaluations within ELK

Jonathan Uesato6 Jan 2022 15:34 UTC

25 points

1 comment10 min readLW link

ELK Proposal: Thinking Via A Human Imitator

TurnTrout22 Feb 2022 1:52 UTC

31 points

6 comments11 min readLW link

ELK First Round Contest Winners

Mark Xu and paulfchristiano

26 Jan 2022 2:56 UTC

65 points

6 comments1 min readLW link

Eliciting Latent Knowledge Via Hypothetical Sensors

John_Maxwell30 Dec 2021 15:53 UTC

38 points

1 comment6 min readLW link

My Reservations about Discovering Latent Knowledge (Burns, Ye, et al)

Robert_AIZI27 Dec 2022 17:27 UTC

50 points

0 comments4 min readLW link

(aizi.substack.com)

Implications of automated ontology identification

Alex Flint, adamShimi and Robert Miles

18 Feb 2022 3:30 UTC

69 points

27 comments23 min readLW link

ELK contest submission: route understanding through the human ontology

Vika, Ramana Kumar and Vikrant Varma

14 Mar 2022 21:42 UTC

21 points

2 comments2 min readLW link

[Question] Can you be Not Even Wrong in AI Alignment?

throwaway823819 Mar 2022 17:41 UTC

22 points

7 comments8 min readLW link

REPL’s: a type signature for agents

scottviteri15 Feb 2022 22:57 UTC

25 points

6 comments2 min readLW link

Musings on the Speed Prior

evhub2 Mar 2022 4:04 UTC

33 points

4 comments10 min readLW link

AXRP Episode 29 - Science of Deep Learning with Vikrant Varma

DanielFilan25 Apr 2024 19:10 UTC

20 points

1 comment63 min readLW link

What Does The Natural Abstraction Framework Say About ELK?

johnswentworth15 Feb 2022 2:27 UTC

35 points

0 comments6 min readLW link

A rough idea for solving ELK: An approach for training generalist agents like GATO to make plans and describe them to humans clearly and honestly.

Michael Soareverix8 Sep 2022 15:20 UTC

2 points

2 comments2 min readLW link

Obstacles in ARC’s agenda: Finding explanations

David Matolcsi30 Apr 2025 23:03 UTC

123 points

10 comments17 min readLW link

Can we efficiently explain model behaviors?

paulfchristiano16 Dec 2022 19:40 UTC

64 points

3 comments9 min readLW link

(ai-alignment.com)

ELK Sub—Note-taking in internal rollouts

Hoagy9 Mar 2022 17:23 UTC

6 points

0 comments5 min readLW link

Collin Burns on Alignment Research And Discovering Latent Knowledge Without Supervision

Michaël Trazzi17 Jan 2023 17:21 UTC

25 points

5 comments4 min readLW link

(theinsideview.ai)

If you’re very optimistic about ELK then you should be optimistic about outer alignment

Sam Marks27 Apr 2022 19:30 UTC

17 points

8 comments3 min readLW link

Clarifying what ELK is trying to achieve

Towards_Keeperhood21 May 2022 7:34 UTC

22 points

1 comment5 min readLW link

Is ELK enough? Diamond, Matrix and Child AI

adamShimi15 Feb 2022 2:29 UTC

17 points

10 comments4 min readLW link

Note-Taking without Hidden Messages

Hoagy30 Apr 2022 11:15 UTC

17 points

2 comments4 min readLW link

Understanding the two-head strategy for teaching ML to answer questions honestly

Adam Scherlis11 Jan 2022 23:24 UTC

29 points

1 comment10 min readLW link

Obstacles in ARC’s agenda: Low Probability Estimation

David Matolcsi2 May 2025 19:38 UTC

44 points

0 comments6 min readLW link

[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF

Leon Lang22 Oct 2024 13:57 UTC

51 points

2 comments18 min readLW link

(arxiv.org)

Two Challenges for ELK

derek shiller21 Feb 2022 5:49 UTC

7 points

0 comments4 min readLW link

[ASoT] Observations about ELK

leogao26 Mar 2022 0:42 UTC

34 points

0 comments3 min readLW link

[Question] How is ARC planning to use ELK?

jacquesthibs15 Dec 2022 20:11 UTC

24 points

5 comments1 min readLW link

Obstacles in ARC’s agenda: Mechanistic Anomaly Detection

David Matolcsi1 May 2025 20:51 UTC

42 points

1 comment11 min readLW link

Measurement tampering detection as a special case of weak-to-strong generalization

ryan_greenblatt, Fabien Roger and Buck

23 Dec 2023 0:05 UTC

57 points

10 comments4 min readLW link

ELK Thought Dump

abramdemski28 Feb 2022 18:46 UTC

61 points

18 comments17 min readLW link

ELK Computational Complexity: Three Levels of Difficulty

abramdemski30 Mar 2022 20:56 UTC

46 points

9 comments7 min readLW link

Here’s 18 Applications of Deception Probes

Cleo Nardo, Avi Parrack and jordine

28 Aug 2025 18:59 UTC

38 points

0 comments22 min readLW link

What Discovering Latent Knowledge Did and Did Not Find

Fabien Roger13 Mar 2023 19:29 UTC

166 points

17 comments11 min readLW link

[ASoT] Some thoughts on human abstractions

leogao16 Mar 2023 5:42 UTC

42 points

4 comments5 min readLW link

Some Hacky ELK Ideas

johnswentworth15 Feb 2022 2:27 UTC

37 points

8 comments5 min readLW link

How dangerous is encoded reasoning?

artkpv30 Jun 2025 11:54 UTC

17 points

0 comments10 min readLW link

[Question] Popular materials about environmental goals/agent foundations? People wanting to discuss such topics?

Q Home22 Jan 2025 3:30 UTC

5 points

0 comments1 min readLW link

Mechanistic Anomaly Detection Research Update

Nora Belrose and David Johnston

6 Aug 2024 10:33 UTC

11 points

0 comments1 min readLW link

(blog.eleuther.ai)

ELK Proposal—Make the Reporter care about the Predictor’s beliefs

Adam Jermyn and Nicholas Schiefer

11 Jun 2022 22:53 UTC

8 points

0 comments6 min readLW link

You won’t solve alignment without agent foundations

Mikhail Samin6 Nov 2022 8:07 UTC

29 points

3 comments8 min readLW link

CCS on compound sentences

artkpv4 May 2024 12:23 UTC

6 points

0 comments9 min readLW link

The limited upside of interpretability

Peter S. Park15 Nov 2022 18:46 UTC

13 points

11 comments10 min readLW link

A single principle related to many Alignment subproblems?

Q Home30 Apr 2025 9:49 UTC

43 points

34 comments17 min readLW link

Abram Demski’s ELK thoughts and proposal—distillation

Rubi J. Hudson19 Jul 2022 6:57 UTC

19 points

8 comments16 min readLW link

Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC

37 points

4 comments2 min readLW link

Discovering Latent Knowledge in Language Models Without Supervision

Xodarap14 Dec 2022 12:32 UTC

45 points

1 comment1 min readLW link

(arxiv.org)

Croesus, Cerberus, and the magpies: a gentle introduction to Eliciting Latent Knowledge

Alexandre Variengien27 May 2022 17:58 UTC

22 points

0 comments16 min readLW link

Weight-diff SVD for LLM Monitoring

Ziqian Zhong5 Aug 2025 0:31 UTC

2 points

0 comments2 min readLW link

(arxiv.org)

Uncovering Latent Human Wellbeing in LLM Embeddings

ChengCheng, Pedro Freire, Dan H and Scott Emmons

14 Sep 2023 1:40 UTC

32 points

7 comments8 min readLW link

(far.ai)

Ground-Truth Label Imbalance Impairs the Performance of Contrast-Consistent Search (and Other Contrast-Pair-Based Unsupervised Methods)

Tom Angsten and Ami Hays

5 Aug 2023 17:55 UTC

6 points

2 comments7 min readLW link

(drive.google.com)

[ASoT] Simulators show us behavioural properties by default

Jozdien13 Jan 2023 18:42 UTC

36 points

3 comments3 min readLW link

Representational Tethers: Tying AI Latents To Human Ones

Paul Bricman16 Sep 2022 14:45 UTC

30 points

0 comments16 min readLW link

Clarifying Alignment Fundamentals Through the Lens of Ontology

Ben Ihrig7 Oct 2024 20:57 UTC

12 points

4 comments24 min readLW link

Locating and Editing Knowledge in LMs

Dhananjay Ashok24 Jan 2025 22:53 UTC

1 point

0 comments4 min readLW link

ELK shaving

Miss Aligned AI1 May 2022 21:05 UTC

6 points

1 comment1 min readLW link

Searching for a model’s concepts by their shape – a theoretical framework

Kaarel, gekaklam, Walter Laurito , Kay Kozaronek, AlexMennen and June Ku

23 Feb 2023 20:14 UTC

51 points

0 comments19 min readLW link

Discovering Latent Knowledge in the Human Brain: Part 1 – Clarifying the concepts of belief and knowledge

Joseph Emerson15 Oct 2023 9:02 UTC

5 points

0 comments12 min readLW link

Attributing to interactions with GCPD and GWPD

jenny11 Oct 2023 15:06 UTC

20 points

0 comments6 min readLW link

Logical Decision Theories: Our final failsafe?

Noosphere8925 Oct 2022 12:51 UTC

−7 points

8 comments1 min readLW link

(www.lesswrong.com)

What happens when LLMs learn new things? & Continual learning forever.

sunchipsster15 Apr 2025 18:38 UTC

4 points

1 comment7 min readLW link

Thoughts on self-inspecting neural networks.

Deruwyn12 Mar 2023 23:58 UTC

4 points

2 comments5 min readLW link

Surprised by ELK report’s counterexample to Debate, IDA

Evan R. Murphy4 Aug 2022 2:12 UTC

22 points

0 comments5 min readLW link

Still no Lie Detector for LLMs

Daniel Herrmann and ben_levinstein

18 Jul 2023 19:56 UTC

50 points

3 comments21 min readLW link

Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS)

Scott Emmons31 May 2023 17:09 UTC

97 points

1 comment6 min readLW link 1 review

Can we efficiently distinguish different mechanisms?

paulfchristiano27 Dec 2022 0:20 UTC

91 points

30 comments16 min readLW link

(ai-alignment.com)

Eliciting Latent Knowledge in Comprehensive AI Services Models

acabodi17 Nov 2023 2:36 UTC

6 points

0 comments5 min readLW link

Half-baked idea: a straightforward method for learning environmental goals?

Q Home4 Feb 2025 6:56 UTC

16 points

7 comments5 min readLW link

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

Florian_Dietz10 Mar 2025 16:07 UTC

42 points

7 comments9 min readLW link

A personal explanation of ELK concept and task.

Zeyu Qin6 Oct 2023 3:55 UTC

1 point

0 comments1 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC

58 points

0 comments59 min readLW link

Towards building blocks of ontologies

Daniel C, Alex_Altair, Dalcy, Alfred Harwood and JoseFaustino

8 Feb 2025 16:03 UTC

29 points

0 comments26 min readLW link

Vaniver’s ELK Submission

Vaniver28 Mar 2022 21:14 UTC

10 points

0 comments7 min readLW link

Finding the estimate of the value of a state in RL agents

Clément Dumas, Walter Laurito , KlaRo and Kaarel

3 Jun 2024 20:26 UTC

8 points

4 comments4 min readLW link

The ELK Framing I’ve Used

sudo19 Sep 2022 10:28 UTC

5 points

1 comment1 min readLW link

The Greedy Doctor Problem… turns out to be relevant to the ELK problem?

Jan14 Jan 2022 11:58 UTC

36 points

10 comments14 min readLW link

(universalprior.substack.com)

“What the hell is a representation, anyway?” | Clarifying AI interpretability with tools from philosophy of cognitive science | Part 1: Vehicles vs. contents

IwanWilliams9 Jun 2024 14:19 UTC

9 points

1 comment4 min readLW link

Goal-misgeneralization is ELK-hard

rokosbasilisk10 Jun 2023 9:32 UTC

2 points

0 comments1 min readLW link

Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity.

Josh Levy4 Jun 2024 15:45 UTC

40 points

0 comments18 min readLW link

Is GPT3 a Good Rationalist? - InstructGPT3 [2/2]

simeon_c7 Apr 2022 13:46 UTC

11 points

0 comments7 min readLW link

Article Review: Discovering Latent Knowledge (Burns, Ye, et al)

Robert_AIZI22 Dec 2022 18:16 UTC

13 points

4 comments6 min readLW link

(aizi.substack.com)

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Seb Farquhar, Vikrant Varma, zac_kenton, gasteigerjo, Vlad Mikulik and Rohin Shah

18 Dec 2023 11:58 UTC

149 points

21 comments10 min readLW link

Bounded complexity of solving ELK and its implications

Rubi J. Hudson19 Jul 2022 6:56 UTC

11 points

4 comments18 min readLW link

How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

Collin15 Dec 2022 18:22 UTC

244 points

41 comments16 min readLW link 1 review

Eliciting Latent Knowledge (ELK) - Distillation/Summary

Marius Hobbhahn8 Jun 2022 13:18 UTC

69 points

2 comments21 min readLW link

Auditing LMs with counterfactual search: a tool for control and ELK

Jacob Pfau20 Feb 2024 0:02 UTC

28 points

6 comments10 min readLW link

REPL’s and ELK

scottviteri17 Feb 2022 1:14 UTC

9 points

4 comments1 min readLW link

Covert Malicious Finetuning

Tony Wang and dannyhalawi

2 Jul 2024 2:41 UTC

94 points

4 comments3 min readLW link

Betting on what is un-falsifiable and un-verifiable

Abhimanyu Pallavi Sudhir14 Nov 2023 21:11 UTC

13 points

0 comments15 min readLW link

For ELK truth is mostly a distraction

c.trout4 Nov 2022 21:14 UTC

44 points

0 comments21 min readLW link

[ASoT] Some ways ELK could still be solvable in practice

leogao27 Mar 2022 1:15 UTC

26 points

1 comment2 min readLW link

Where I currently disagree with Ryan Greenblatt’s version of the ELK approach

So8res29 Sep 2022 21:18 UTC

65 points

7 comments5 min readLW link

A Bite Sized Introduction to ELK

Luk2718217 Sep 2022 0:28 UTC

5 points

0 comments6 min readLW link

[RFC] Possible ways to expand on “Discovering Latent Knowledge in Language Models Without Supervision”.

gekaklam, Walter Laurito , Kaarel and Kay Kozaronek

25 Jan 2023 19:03 UTC

48 points

6 comments12 min readLW link

How To Know What the AI Knows—An ELK Distillation

Fabien Roger4 Sep 2022 0:46 UTC

7 points

0 comments5 min readLW link

ARC paper: Formalizing the presumption of independence

Erik Jenner20 Nov 2022 1:22 UTC

97 points

2 comments2 min readLW link

(arxiv.org)

No comments.