Arthur Conmy

Karma: 2,469

Anthropic, previously GDM

Open Distillation of Hereditary Traits

Arthur Conmy14 Jul 2026 10:15 UTC

39 points

0 comments14 min readLW link

Where Do LLM Values Come From?

lilysun004, Arthur Conmy and Josh Engels

9 Jul 2026 19:54 UTC

17 points

1 comment18 min readLW link

How transparent is DiffusionGemma (and why it matters)

Josh Engels, Callum McDougall, bilalchughtai, János Kramár, Senthooran Rajamanoharan, Arthur Conmy, Rohin Shah and Neel Nanda

20 Jun 2026 20:05 UTC

86 points

2 comments4 min readLW link

Synthetic document finetuning for instilling positive traits

CallumMcDougall, Arthur Conmy and Neel Nanda

16 Jun 2026 0:04 UTC

63 points

1 comment10 min readLW link

SFT Drives Gemini’s Safety Properties

Josh Engels, Arthur Conmy, bilalchughtai and Neel Nanda

13 Jun 2026 15:31 UTC

90 points

4 comments1 min readLW link

AIs will be used in “unhinged” configurations

Arthur Conmy11 Mar 2026 11:19 UTC

62 points

3 comments4 min readLW link

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought

Riya Tyagi, daria, Arthur Conmy and Neel Nanda

13 Jan 2026 20:40 UTC

52 points

0 comments18 min readLW link

Announcing Gemma Scope 2

CallumMcDougall, Arthur Conmy, János Kramár, Tom Lieberum, Senthooran Rajamanoharan and Neel Nanda

22 Dec 2025 21:56 UTC

96 points

1 comment2 min readLW link

Can we interpret latent reasoning using current mechanistic interpretability tools?

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

22 Dec 2025 16:56 UTC

44 points

1 comment9 min readLW link

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

68 points

1 comment14 min readLW link

A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

140 points

39 comments27 min readLW link

Current LLMs seem to rarely detect CoT tampering

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan and Josh Engels

19 Nov 2025 15:27 UTC

56 points

0 comments20 min readLW link

Eliciting secret knowledge from language models

Bartosz Cywiński, Arthur Conmy and Sam Marks

2 Oct 2025 20:57 UTC

69 points

3 comments2 min readLW link

(arxiv.org)

Discovering Backdoor Triggers

andrq, Tim Hua, Sam Marks, Arthur Conmy and Neel Nanda

19 Aug 2025 6:24 UTC

57 points

4 comments13 min readLW link

Unfaithful chain-of-thought as nudged reasoning

Paul B, Uzay Macar, Arthur Conmy and Neel Nanda

22 Jul 2025 22:35 UTC

54 points

3 comments10 min readLW link

Thought Anchors: Which LLM Reasoning Steps Matter?

Uzay Macar, Paul B, Neel Nanda and Arthur Conmy

2 Jul 2025 20:16 UTC

36 points

6 comments6 min readLW link

(www.thought-anchors.com)

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah and Neel Nanda

26 Mar 2025 19:07 UTC

117 points

15 comments29 min readLW link

(deepmindsafetyresearch.medium.com)

The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research

Arthur Conmy and Neel Nanda

24 Feb 2025 2:17 UTC

48 points

1 comment7 min readLW link

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

11 Dec 2024 6:30 UTC

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

Evolutionary prompt optimization for SAE feature visualization

neverix, Daniel Tan, Dmitrii Kharlapenko, Neel Nanda and Arthur Conmy

14 Nov 2024 13:06 UTC

28 points

0 comments9 min readLW link