Owain_Evans

Karma: 4,150

https://owainevans.github.io/

Weird Generalization & Inductive Backdoors

Jorio Cocola, Owain_Evans and dylan_f

11 Dec 2025 18:18 UTC

146 points

7 comments8 min readLW link

Lessons from Studying Two-Hop Latent Reasoning

Mikita Balesni, Tomek Korbak and Owain_Evans

11 Sep 2025 17:53 UTC

68 points

16 comments2 min readLW link

(arxiv.org)

Harmless reward hacks can generalize to misalignment in LLMs

Mia Taylor and Owain_Evans

26 Aug 2025 17:32 UTC

52 points

7 comments7 min readLW link

Concept Poisoning: Probing LLMs without probes

Jan Betley, Jorio Cocola, dylan_f and Owain_Evans

5 Aug 2025 17:00 UTC

60 points

5 comments13 min readLW link

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

cloud, mle and Owain_Evans

22 Jul 2025 16:37 UTC

343 points

39 comments4 min readLW link

Backdoor awareness and misaligned personas in reasoning models

James Chua, Owain_Evans and Jan Betley

20 Jun 2025 23:38 UTC

34 points

8 comments6 min readLW link

Thought Crime: Backdoors & Emergent Misalignment in Reasoning Models

James Chua and Owain_Evans

16 Jun 2025 16:43 UTC

68 points

2 comments8 min readLW link

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley and Owain_Evans

25 Feb 2025 17:39 UTC

334 points

92 comments4 min readLW link

Tell me about yourself: LLMs are aware of their learned behaviors

Martín Soto and Owain_Evans

22 Jan 2025 0:47 UTC

132 points

5 comments6 min readLW link

New, improved multiple-choice TruthfulQA

Owain_Evans, James Chua and Steph Lin

15 Jan 2025 23:32 UTC

72 points

1 comment3 min readLW link

Inference-Time-Compute: More Faithful? A Research Note

James Chua and Owain_Evans

15 Jan 2025 4:43 UTC

69 points

10 comments11 min readLW link

Tips On Empirical Research Slides

James Chua, John Hughes, Ethan Perez and Owain_Evans

8 Jan 2025 5:06 UTC

97 points

4 comments6 min readLW link

LLMs can learn about themselves by introspection

Felix J Binder and Owain_Evans

18 Oct 2024 16:12 UTC

109 points

38 comments9 min readLW link

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

L Rudolf L, bilalchughtai, Jan Betley, kaivu, Jérémy Scheurer, Mikita Balesni, AlexMeinke, Owain_Evans and Marius Hobbhahn

8 Jul 2024 22:24 UTC

109 points

39 comments5 min readLW link

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Johannes Treutlein and Owain_Evans

21 Jun 2024 15:54 UTC

163 points

13 comments8 min readLW link

(arxiv.org)

How do LLMs give truthful answers? A discussion of LLM vs. human reasoning, ensembles & parrots

Owain_Evans28 Mar 2024 2:34 UTC

27 points

0 comments9 min readLW link

Paper: Tell, Don’t Show- Declarative facts influence how LLMs generalize

Owain_Evans and AlexMeinke

19 Dec 2023 19:14 UTC

45 points

4 comments6 min readLW link

(arxiv.org)

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB, Owain_Evans and SoerenMind

28 Sep 2023 18:53 UTC

187 points

39 comments3 min readLW link 1 review

Paper: LLMs trained on “A is B” fail to learn “B is A”

lberglund, Owain_Evans, Meg, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland and Tomek Korbak

23 Sep 2023 19:55 UTC

121 points

74 comments4 min readLW link

(arxiv.org)

Paper: On measuring situational awareness in LLMs

Owain_Evans, Daniel Kokotajlo, Mikita Balesni, Tomek Korbak, Asa Cooper Stickland, Meg and Maximilian Kaufmann

4 Sep 2023 12:54 UTC

109 points

17 comments5 min readLW link

(arxiv.org)