Arthur Conmy

Karma: 2,469

Anthropic, previously GDM

Open Distillation of Hereditary Traits

Arthur Conmy14 Jul 2026 10:15 UTC

39 points

0 comments14 min readLW link

Where Do LLM Values Come From?

lilysun004, Arthur Conmy and Josh Engels

9 Jul 2026 19:54 UTC

17 points

1 comment18 min readLW link

How transparent is DiffusionGemma (and why it matters)

Josh Engels, Callum McDougall, bilalchughtai, János Kramár, Senthooran Rajamanoharan, Arthur Conmy, Rohin Shah and Neel Nanda

20 Jun 2026 20:05 UTC

86 points

2 comments4 min readLW link

Synthetic document finetuning for instilling positive traits

CallumMcDougall, Arthur Conmy and Neel Nanda

16 Jun 2026 0:04 UTC

63 points

1 comment10 min readLW link

SFT Drives Gemini’s Safety Properties

Josh Engels, Arthur Conmy, bilalchughtai and Neel Nanda

13 Jun 2026 15:31 UTC

90 points

4 comments1 min readLW link

Arthur Conmy 30 May 2026 1:54 UTC
LW: 3 AF: 2
0
AF
on: Advice for making robust-to-training model organisms
Nice work! We’ve even noticed model organisms are sometimes not robust to benign distribution shifts (i.e. not even benign training is required)
In this appendix: https://arxiv.org/pdf/2602.10371v1#page=17.54 we noticed the Gender Model Organism from our earlier work: https://arxiv.org/abs/2510.01070 doesn’t seem to display this bias on WildChat, a very common distribution

AIs will be used in “unhinged” configurations

Arthur Conmy11 Mar 2026 11:19 UTC

62 points

3 comments4 min readLW link

Arthur Conmy 11 Mar 2026 1:30 UTC
2 points
0
in reply to: chanind’s comment on: Letting Claude do Autonomous Research to Improve SAEs
Yeah, I think due to CLT stuff happening, less focus was on the single resid stream SAE (which was probably? a good idea)

Arthur Conmy 10 Mar 2026 23:59 UTC
2 points
0
on: Letting Claude do Autonomous Research to Improve SAEs
Cool case study!
1. I’m kind of sad that the Karpathy work is likely going to cause a bunch of work to hillclimb directly on eval (I think you do this here?). This makes automated AI work sketchy IMO. In https://arxiv.org/abs/2601.11516 we note that e.g. “the
large early drop in Fig. 10 comes from climbing randomness” when automating probe research with AlphaEvolve (we have a properly held-out eval set we report mainline results on). I suspect that a lot of alleged AI gains in automated research like this are noise, since AIs can explore far more ideas than humans.

Note that in https://xcancel.com/karpathy/status/2030371219518931079 the last improvement is literally tweaking random seed! :/

2. It’s been so long since I’ve worked on this but FWIW these sorts of ancient dictionary learning algorithms were definitely in the water supply in 2024… for example here we note on our dictionary learning algorithm that a “possible application is actually replacing the encoder at test time, to increase the loss recovered of the sparse decomposition” and here we even used FISTA

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought

Riya Tyagi, daria, Arthur Conmy and Neel Nanda

13 Jan 2026 20:40 UTC

52 points

0 comments18 min readLW link

Announcing Gemma Scope 2

CallumMcDougall, Arthur Conmy, János Kramár, Tom Lieberum, Senthooran Rajamanoharan and Neel Nanda

22 Dec 2025 21:56 UTC

96 points

1 comment2 min readLW link

Can we interpret latent reasoning using current mechanistic interpretability tools?

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

22 Dec 2025 16:56 UTC

44 points

1 comment9 min readLW link

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

68 points

1 comment14 min readLW link

A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

140 points

39 comments27 min readLW link

Current LLMs seem to rarely detect CoT tampering

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan and Josh Engels

19 Nov 2025 15:27 UTC

56 points

0 comments20 min readLW link

Eliciting secret knowledge from language models

Bartosz Cywiński, Arthur Conmy and Sam Marks

2 Oct 2025 20:57 UTC

69 points

3 comments2 min readLW link

(arxiv.org)

Discovering Backdoor Triggers

andrq, Tim Hua, Sam Marks, Arthur Conmy and Neel Nanda

19 Aug 2025 6:24 UTC

57 points

4 comments13 min readLW link

Arthur Conmy 14 Aug 2025 2:24 UTC
11 points
0
in reply to: Steven Byrnes’s comment on: Mech Interp Wiki Page and Why You Should Edit Wikipedia
An extreme (and close-to-home) example is documented in TracingWoodgrains’s exposé.of David Gerard’s Wikipedia smear campaign against LessWrong and related topics. That’s an unusually crazy story [...]
This is even closer to home—David Gerard has commented on the Wikipedia Talk Page and referenced this LW post: https://web.archive.org/web/20250814022218/https://en.wikipedia.org/wiki/Talk:Mechanistic_interpretability#Bad_sourcing,_COI_editing

Unfaithful chain-of-thought as nudged reasoning

Paul B, Uzay Macar, Arthur Conmy and Neel Nanda

22 Jul 2025 22:35 UTC

54 points

3 comments10 min readLW link

Thought Anchors: Which LLM Reasoning Steps Matter?

Uzay Macar, Paul B, Neel Nanda and Arthur Conmy

2 Jul 2025 20:16 UTC

36 points

6 comments6 min readLW link

(www.thought-anchors.com)