lewis smith

Karma: 922

[Paper] Difficulties with Evaluating a Deception Detector for AIs

bilalchughtai, lewis smith and Neel Nanda

3 Dec 2025 20:07 UTC

30 points

1 comment5 min readLW link

(arxiv.org)

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

62 points

1 comment14 min readLW link

A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

133 points

35 comments27 min readLW link

Towards data-centric interpretability with sparse autoencoders

Nick Jiang, lilysun004, lewis smith and Neel Nanda

15 Aug 2025 20:10 UTC

53 points

2 comments18 min readLW link

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah and Neel Nanda

26 Mar 2025 19:07 UTC

115 points

15 comments29 min readLW link

(deepmindsafetyresearch.medium.com)

A Problem to Solve Before Building a Deception Detector

Eleni Angelou and lewis smith

7 Feb 2025 19:35 UTC

77 points

12 comments14 min readLW link

lewis smith’s Shortform

lewis smith30 Aug 2024 9:51 UTC

12 points

7 comments1 min readLW link

The ‘strong’ feature hypothesis could be wrong

lewis smith2 Aug 2024 14:33 UTC

234 points

29 comments17 min readLW link 1 review

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lewis smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda

25 Apr 2024 18:43 UTC

63 points

38 comments1 min readLW link

(arxiv.org)

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

80 points

10 comments8 min readLW link

[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

73 points

0 comments3 min readLW link

Dropout can create a privileged basis in the ReLU output model.

lewis smith28 Apr 2023 1:59 UTC

24 points

3 comments5 min readLW link