RSS

Neel Nanda

Karma: 13,175

[Paper] Difficul­ties with Eval­u­at­ing a De­cep­tion De­tec­tor for AIs

3 Dec 2025 20:07 UTC
30 points
1 comment5 min readLW link
(arxiv.org)

How Can In­ter­pretabil­ity Re­searchers Help AGI Go Well?

1 Dec 2025 13:05 UTC
62 points
1 comment14 min readLW link

A Prag­matic Vi­sion for Interpretability

1 Dec 2025 13:05 UTC
133 points
35 comments27 min readLW link

Cur­rent LLMs seem to rarely de­tect CoT tampering

19 Nov 2025 15:27 UTC
51 points
0 comments20 min readLW link

Steer­ing Eval­u­a­tion-Aware Models to Act Like They Are Deployed

30 Oct 2025 15:03 UTC
61 points
12 comments16 min readLW link

Nar­row Fine­tun­ing Leaves Clearly Read­able Traces in Ac­ti­va­tion Differences

5 Sep 2025 12:11 UTC
53 points
2 comments7 min readLW link

How To Be­come A Mechanis­tic In­ter­pretabil­ity Researcher

Neel Nanda2 Sep 2025 23:38 UTC
121 points
12 comments55 min readLW link

Dis­cov­er­ing Back­door Triggers

19 Aug 2025 6:24 UTC
57 points
4 comments13 min readLW link

Towards data-cen­tric in­ter­pretabil­ity with sparse autoencoders

15 Aug 2025 20:10 UTC
53 points
2 comments18 min readLW link

Neel Nanda MATS Ap­pli­ca­tions Open (Due Aug 29)

Neel Nanda30 Jul 2025 0:55 UTC
22 points
0 comments7 min readLW link
(tinyurl.com)

Rea­son­ing-Fine­tun­ing Repur­poses La­tent Rep­re­sen­ta­tions in Base Models

23 Jul 2025 16:18 UTC
35 points
1 comment2 min readLW link
(arxiv.org)

Steer­ing Out-of-Distri­bu­tion Gen­er­al­iza­tion with Con­cept Abla­tion Fine-Tuning

23 Jul 2025 14:57 UTC
79 points
8 comments5 min readLW link

Un­faith­ful chain-of-thought as nudged reasoning

22 Jul 2025 22:35 UTC
54 points
3 comments10 min readLW link

Nar­row Misal­ign­ment is Hard, Emer­gent Misal­ign­ment is Easy

14 Jul 2025 21:05 UTC
133 points
24 comments5 min readLW link

Self-preser­va­tion or In­struc­tion Am­bi­guity? Ex­am­in­ing the Causes of Shut­down Resistance

14 Jul 2025 14:52 UTC
80 points
19 comments11 min readLW link

Thought An­chors: Which LLM Rea­son­ing Steps Mat­ter?

2 Jul 2025 20:16 UTC
35 points
6 comments6 min readLW link
(www.thought-anchors.com)

SAE on ac­ti­va­tion differences

30 Jun 2025 17:50 UTC
44 points
3 comments5 min readLW link

What We Learned Try­ing to Diff Base and Chat Models (And Why It Mat­ters)

30 Jun 2025 17:17 UTC
106 points
2 comments7 min readLW link

Agen­tic In­ter­pretabil­ity: A Strat­egy Against Grad­ual Disempowerment

17 Jun 2025 14:52 UTC
17 points
6 comments2 min readLW link

Con­ver­gent Lin­ear Rep­re­sen­ta­tions of Emer­gent Misalignment

16 Jun 2025 15:47 UTC
74 points
1 comment8 min readLW link