RSS

lewis smith

Karma: 789

Towards data-cen­tric in­ter­pretabil­ity with sparse autoencoders

15 Aug 2025 20:10 UTC
53 points
2 comments18 min readLW link

Nega­tive Re­sults for SAEs On Down­stream Tasks and Depri­ori­tis­ing SAE Re­search (GDM Mech In­terp Team Progress Up­date #2)

26 Mar 2025 19:07 UTC
113 points
15 comments29 min readLW link
(deepmindsafetyresearch.medium.com)

A Prob­lem to Solve Be­fore Build­ing a De­cep­tion Detector

7 Feb 2025 19:35 UTC
77 points
12 comments14 min readLW link