RSS

Arthur Conmy

Karma: 1,843

Interpretability

Views my own

Elic­it­ing se­cret knowl­edge from lan­guage models

2 Oct 2025 20:57 UTC
67 points
3 comments2 min readLW link
(arxiv.org)

Dis­cov­er­ing Back­door Triggers

19 Aug 2025 6:24 UTC
57 points
4 comments13 min readLW link

Un­faith­ful chain-of-thought as nudged reasoning

22 Jul 2025 22:35 UTC
54 points
3 comments10 min readLW link

Thought An­chors: Which LLM Rea­son­ing Steps Mat­ter?

2 Jul 2025 20:16 UTC
35 points
6 comments6 min readLW link
(www.thought-anchors.com)

Nega­tive Re­sults for SAEs On Down­stream Tasks and Depri­ori­tis­ing SAE Re­search (GDM Mech In­terp Team Progress Up­date #2)

26 Mar 2025 19:07 UTC
113 points
15 comments29 min readLW link
(deepmindsafetyresearch.medium.com)

The GDM AGI Safety+Align­ment Team is Hiring for Ap­plied In­ter­pretabil­ity Research

24 Feb 2025 2:17 UTC
48 points
1 comment7 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

11 Dec 2024 6:30 UTC
82 points
6 comments2 min readLW link
(www.neuronpedia.org)

Evolu­tion­ary prompt op­ti­miza­tion for SAE fea­ture visualization

14 Nov 2024 13:06 UTC
22 points
0 comments9 min readLW link

SAEs are highly dataset de­pen­dent: a case study on the re­fusal direction

7 Nov 2024 5:22 UTC
67 points
4 comments14 min readLW link

Open Source Repli­ca­tion of An­thropic’s Cross­coder pa­per for model-diffing

27 Oct 2024 18:46 UTC
48 points
4 comments5 min readLW link

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

12 Oct 2024 14:54 UTC
29 points
4 comments7 min readLW link

Base LLMs re­fuse too

29 Sep 2024 16:04 UTC
61 points
20 comments10 min readLW link

Ex­tract­ing SAE task fea­tures for in-con­text learning

12 Aug 2024 20:34 UTC
31 points
1 comment9 min readLW link

Self-ex­plain­ing SAE features

5 Aug 2024 22:20 UTC
62 points
13 comments10 min readLW link

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

19 Jul 2024 16:10 UTC
55 points
10 comments1 min readLW link
(storage.googleapis.com)

SAEs (usu­ally) Trans­fer Between Base and Chat Models

18 Jul 2024 10:29 UTC
67 points
0 comments10 min readLW link

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

21 Jun 2024 12:56 UTC
33 points
3 comments19 min readLW link

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

25 Apr 2024 18:43 UTC
63 points
38 comments1 min readLW link
(arxiv.org)

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
80 points
10 comments8 min readLW link

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
73 points
0 comments3 min readLW link