RSS

lewis smith

Karma: 922

[Paper] Difficul­ties with Eval­u­at­ing a De­cep­tion De­tec­tor for AIs

3 Dec 2025 20:07 UTC
30 points
1 comment5 min readLW link
(arxiv.org)

How Can In­ter­pretabil­ity Re­searchers Help AGI Go Well?

1 Dec 2025 13:05 UTC
62 points
1 comment14 min readLW link

A Prag­matic Vi­sion for Interpretability

1 Dec 2025 13:05 UTC
133 points
35 comments27 min readLW link

Towards data-cen­tric in­ter­pretabil­ity with sparse autoencoders

15 Aug 2025 20:10 UTC
53 points
2 comments18 min readLW link

Nega­tive Re­sults for SAEs On Down­stream Tasks and Depri­ori­tis­ing SAE Re­search (GDM Mech In­terp Team Progress Up­date #2)

26 Mar 2025 19:07 UTC
115 points
15 comments29 min readLW link
(deepmindsafetyresearch.medium.com)

A Prob­lem to Solve Be­fore Build­ing a De­cep­tion Detector

7 Feb 2025 19:35 UTC
77 points
12 comments14 min readLW link

lewis smith’s Shortform

lewis smith30 Aug 2024 9:51 UTC
12 points
7 comments1 min readLW link

The ‘strong’ fea­ture hy­poth­e­sis could be wrong

lewis smith2 Aug 2024 14:33 UTC
234 points
29 comments17 min readLW link1 review

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

25 Apr 2024 18:43 UTC
63 points
38 comments1 min readLW link
(arxiv.org)

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
80 points
10 comments8 min readLW link

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
73 points
0 comments3 min readLW link

Dropout can cre­ate a priv­ileged ba­sis in the ReLU out­put model.

lewis smith28 Apr 2023 1:59 UTC
24 points
3 comments5 min readLW link