RSS

Neel Nanda

Karma: 6,388

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of Grokking

15 Aug 2022 2:41 UTC
368 points
47 comments36 min readLW link1 review
(colab.research.google.com)

In­ten­tion­ally Mak­ing Close Friends

Neel Nanda27 Jun 2021 23:06 UTC
280 points
35 comments18 min readLW link1 review
(www.neelnanda.io)

Ac­tu­ally, Othello-GPT Has A Lin­ear Emer­gent World Representation

Neel Nanda29 Mar 2023 22:13 UTC
210 points
24 comments19 min readLW link
(neelnanda.io)

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
127 points
9 comments15 min readLW link

A Longlist of The­o­ries of Im­pact for Interpretability

Neel Nanda11 Mar 2022 14:55 UTC
127 points
35 comments5 min readLW link2 reviews

How to teach things well

Neel Nanda28 Aug 2020 16:44 UTC
106 points
16 comments15 min readLW link1 review
(www.neelnanda.io)

Fact Find­ing: At­tempt­ing to Re­v­erse-Eng­ineer Fac­tual Re­call on the Neu­ron Level (Post 1)

23 Dec 2023 2:44 UTC
106 points
4 comments22 min readLW link

200 Con­crete Open Prob­lems in Mechanis­tic In­ter­pretabil­ity: Introduction

Neel Nanda28 Dec 2022 21:06 UTC
102 points
0 comments10 min readLW link

Open Source Repli­ca­tion & Com­men­tary on An­thropic’s Dic­tionary Learn­ing Paper

Neel Nanda23 Oct 2023 22:38 UTC
91 points
12 comments9 min readLW link

A Com­pre­hen­sive Mechanis­tic In­ter­pretabil­ity Ex­plainer & Glossary

Neel Nanda21 Dec 2022 12:35 UTC
82 points
6 comments2 min readLW link
(neelnanda.io)

The Skill of Notic­ing Emotions

Neel Nanda4 Jun 2020 17:48 UTC
76 points
6 comments17 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers

Neel Nanda18 Oct 2022 21:08 UTC
70 points
5 comments12 min readLW link
(www.neelnanda.io)

Mean­ingful Rest

Neel Nanda29 Aug 2020 15:50 UTC
70 points
4 comments4 min readLW link
(www.neelnanda.io)

Real-Time Re­search Record­ing: Can a Trans­former Re-Derive Po­si­tional Info?

Neel Nanda1 Nov 2022 23:56 UTC
69 points
16 comments1 min readLW link
(youtu.be)

Mech In­terp Puz­zle 1: Sus­pi­ciously Similar Embed­dings in GPT-Neo

Neel Nanda16 Jul 2023 22:02 UTC
65 points
15 comments1 min readLW link

How I Formed My Own Views About AI Safety

Neel Nanda27 Feb 2022 18:50 UTC
64 points
6 comments13 min readLW link
(www.neelnanda.io)

A Bare­bones Guide to Mechanis­tic In­ter­pretabil­ity Prerequisites

Neel Nanda24 Oct 2022 20:45 UTC
63 points
12 comments3 min readLW link
(neelnanda.io)

Seek­ing In­terns/​RAs for Mechanis­tic In­ter­pretabil­ity Projects

Neel Nanda15 Aug 2022 7:11 UTC
61 points
0 comments2 min readLW link

Ret­ro­spec­tive on Teach­ing Ra­tion­al­ity Workshops

Neel Nanda3 Jan 2021 17:15 UTC
59 points
2 comments31 min readLW link