RSS

Neel Nanda

Karma: 6,947

Mechanis­tic In­ter­pretabil­ity Work­shop Hap­pen­ing at ICML 2024!

3 May 2024 1:18 UTC
47 points
6 comments1 min readLW link

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

30 Apr 2024 17:58 UTC
58 points
12 comments17 min readLW link

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
185 points
77 comments10 min readLW link