RSS

Arthur Conmy

Karma: 2,211

Interpretability

Views my own

AIs will be used in “un­hinged” configurations

Arthur Conmy11 Mar 2026 11:19 UTC
56 points
3 comments4 min readLW link

Global CoT Anal­y­sis: Ini­tial at­tempts to un­cover pat­terns across many chains of thought

13 Jan 2026 20:40 UTC
52 points
0 comments18 min readLW link

An­nounc­ing Gemma Scope 2

22 Dec 2025 21:56 UTC
95 points
1 comment2 min readLW link

Can we in­ter­pret la­tent rea­son­ing us­ing cur­rent mechanis­tic in­ter­pretabil­ity tools?

22 Dec 2025 16:56 UTC
37 points
0 comments9 min readLW link

How Can In­ter­pretabil­ity Re­searchers Help AGI Go Well?

1 Dec 2025 13:05 UTC
66 points
1 comment14 min readLW link

A Prag­matic Vi­sion for Interpretability

1 Dec 2025 13:05 UTC
131 points
39 comments27 min readLW link

Cur­rent LLMs seem to rarely de­tect CoT tampering

19 Nov 2025 15:27 UTC
55 points
0 comments20 min readLW link

Elic­it­ing se­cret knowl­edge from lan­guage models

2 Oct 2025 20:57 UTC
68 points
3 comments2 min readLW link
(arxiv.org)

Dis­cov­er­ing Back­door Triggers

19 Aug 2025 6:24 UTC
57 points
4 comments13 min readLW link

Un­faith­ful chain-of-thought as nudged reasoning

22 Jul 2025 22:35 UTC
54 points
3 comments10 min readLW link

Thought An­chors: Which LLM Rea­son­ing Steps Mat­ter?

2 Jul 2025 20:16 UTC
35 points
6 comments6 min readLW link
(www.thought-anchors.com)

Nega­tive Re­sults for SAEs On Down­stream Tasks and Depri­ori­tis­ing SAE Re­search (GDM Mech In­terp Team Progress Up­date #2)

26 Mar 2025 19:07 UTC
116 points
15 comments29 min readLW link
(deepmindsafetyresearch.medium.com)

The GDM AGI Safety+Align­ment Team is Hiring for Ap­plied In­ter­pretabil­ity Research

24 Feb 2025 2:17 UTC
48 points
1 comment7 min readLW link