RSS

Lee Sharkey(Lee Sharkey)

Karma: 1,125

Research engineer at Apollo Research (London).

My main research interests are mechanistic interpretability and inner alignment.

Iden­ti­fy­ing Func­tion­ally Im­por­tant Fea­tures with End-to-End Sparse Dic­tionary Learning

17 May 2024 16:25 UTC
53 points
2 comments4 min readLW link
(publications.apolloresearch.ai)

Gated At­ten­tion Blocks: Pre­limi­nary Progress to­ward Re­mov­ing At­ten­tion Head Superposition

8 Apr 2024 11:14 UTC
36 points
4 comments15 min readLW link

Spar­sify: A mechanis­tic in­ter­pretabil­ity re­search agenda

Lee Sharkey3 Apr 2024 12:34 UTC
85 points
22 comments22 min readLW link

Ad­dress­ing Fea­ture Sup­pres­sion in SAEs

16 Feb 2024 18:32 UTC
80 points
3 comments10 min readLW link

The­o­ries of Change for AI Auditing

13 Nov 2023 19:33 UTC
53 points
0 comments18 min readLW link
(www.apolloresearch.ai)

An­nounc­ing Apollo Research

30 May 2023 16:17 UTC
215 points
11 comments8 min readLW link

‘Fun­da­men­tal’ vs ‘ap­plied’ mechanis­tic in­ter­pretabil­ity research

Lee Sharkey23 May 2023 18:26 UTC
62 points
6 comments3 min readLW link

A tech­ni­cal note on bil­in­ear lay­ers for interpretability

Lee Sharkey8 May 2023 6:06 UTC
54 points
0 comments1 min readLW link
(arxiv.org)

A small up­date to the Sparse Cod­ing in­terim re­search report

30 Apr 2023 19:54 UTC
61 points
5 comments1 min readLW link

Why al­most ev­ery RL agent does learned optimization

Lee Sharkey12 Feb 2023 4:58 UTC
32 points
3 comments5 min readLW link

[In­terim re­search re­port] Tak­ing fea­tures out of su­per­po­si­tion with sparse autoencoders

13 Dec 2022 15:41 UTC
142 points
22 comments22 min readLW link2 reviews

Cur­rent themes in mechanis­tic in­ter­pretabil­ity research

16 Nov 2022 14:14 UTC
89 points
2 comments12 min readLW link

In­ter­pret­ing Neu­ral Net­works through the Poly­tope Lens

23 Sep 2022 17:58 UTC
136 points
29 comments33 min readLW link

Cir­cum­vent­ing in­ter­pretabil­ity: How to defeat mind-readers

Lee Sharkey14 Jul 2022 16:59 UTC
112 points
12 comments33 min readLW link