RSS

Owain_Evans

Karma: 4,150

https://​​owainevans.github.io/​​

Weird Gen­er­al­iza­tion & In­duc­tive Backdoors

11 Dec 2025 18:18 UTC
146 points
7 comments8 min readLW link

Les­sons from Study­ing Two-Hop La­tent Reasoning

11 Sep 2025 17:53 UTC
68 points
16 comments2 min readLW link
(arxiv.org)

Harm­less re­ward hacks can gen­er­al­ize to mis­al­ign­ment in LLMs

26 Aug 2025 17:32 UTC
52 points
7 comments7 min readLW link

Con­cept Poi­son­ing: Prob­ing LLMs with­out probes

5 Aug 2025 17:00 UTC
60 points
5 comments13 min readLW link

Sublimi­nal Learn­ing: LLMs Trans­mit Be­hav­ioral Traits via Hid­den Sig­nals in Data

22 Jul 2025 16:37 UTC
343 points
39 comments4 min readLW link

Back­door aware­ness and mis­al­igned per­sonas in rea­son­ing models

20 Jun 2025 23:38 UTC
34 points
8 comments6 min readLW link

Thought Crime: Back­doors & Emer­gent Misal­ign­ment in Rea­son­ing Models

16 Jun 2025 16:43 UTC
68 points
2 comments8 min readLW link

Emer­gent Misal­ign­ment: Nar­row fine­tun­ing can pro­duce broadly mis­al­igned LLMs

25 Feb 2025 17:39 UTC
334 points
92 comments4 min readLW link

Tell me about your­self: LLMs are aware of their learned behaviors

22 Jan 2025 0:47 UTC
132 points
5 comments6 min readLW link

New, im­proved mul­ti­ple-choice TruthfulQA

15 Jan 2025 23:32 UTC
72 points
1 comment3 min readLW link

In­fer­ence-Time-Com­pute: More Faith­ful? A Re­search Note

15 Jan 2025 4:43 UTC
69 points
10 comments11 min readLW link

Tips On Em­piri­cal Re­search Slides

8 Jan 2025 5:06 UTC
97 points
4 comments6 min readLW link

LLMs can learn about them­selves by introspection

18 Oct 2024 16:12 UTC
109 points
38 comments9 min readLW link

Me, My­self, and AI: the Si­tu­a­tional Aware­ness Dataset (SAD) for LLMs

8 Jul 2024 22:24 UTC
109 points
39 comments5 min readLW link

Con­nect­ing the Dots: LLMs can In­fer & Ver­bal­ize La­tent Struc­ture from Train­ing Data

21 Jun 2024 15:54 UTC
163 points
13 comments8 min readLW link
(arxiv.org)

How do LLMs give truth­ful an­swers? A dis­cus­sion of LLM vs. hu­man rea­son­ing, en­sem­bles & parrots

Owain_Evans28 Mar 2024 2:34 UTC
27 points
0 comments9 min readLW link

Paper: Tell, Don’t Show- Declar­a­tive facts in­fluence how LLMs generalize

19 Dec 2023 19:14 UTC
45 points
4 comments6 min readLW link
(arxiv.org)

How to Catch an AI Liar: Lie De­tec­tion in Black-Box LLMs by Ask­ing Un­re­lated Questions

28 Sep 2023 18:53 UTC
187 points
39 comments3 min readLW link1 review

Paper: LLMs trained on “A is B” fail to learn “B is A”

23 Sep 2023 19:55 UTC
121 points
74 comments4 min readLW link
(arxiv.org)

Paper: On mea­sur­ing situ­a­tional aware­ness in LLMs

4 Sep 2023 12:54 UTC
109 points
17 comments5 min readLW link
(arxiv.org)