RSS

Neel Nanda

Karma: 6,751

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
127 points
40 comments10 min readLW link

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

25 Apr 2024 18:43 UTC
60 points
23 comments1 min readLW link
(arxiv.org)

How to use and in­ter­pret ac­ti­va­tion patching

24 Apr 2024 8:35 UTC
10 points
0 comments18 min readLW link

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
71 points
8 comments8 min readLW link

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
68 points
0 comments3 min readLW link