Neel Nanda

Karma: 13,175

[Paper] Difficulties with Evaluating a Deception Detector for AIs

bilalchughtai, lewis smith and Neel Nanda

3 Dec 2025 20:07 UTC

30 points

1 comment5 min readLW link

(arxiv.org)

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

62 points

1 comment14 min readLW link

A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

133 points

35 comments27 min readLW link

Current LLMs seem to rarely detect CoT tampering

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan and Josh Engels

19 Nov 2025 15:27 UTC

51 points

0 comments20 min readLW link

Steering Evaluation-Aware Models to Act Like They Are Deployed

Tim Hua, andrq, Sam Marks and Neel Nanda

30 Oct 2025 15:03 UTC

61 points

12 comments16 min readLW link

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Julian Minder, Clément Dumas, Stewy Slocum and Neel Nanda

5 Sep 2025 12:11 UTC

53 points

2 comments7 min readLW link

How To Become A Mechanistic Interpretability Researcher

Neel Nanda2 Sep 2025 23:38 UTC

121 points

12 comments55 min readLW link

Discovering Backdoor Triggers

andrq, Tim Hua, Sam Marks, Arthur Conmy and Neel Nanda

19 Aug 2025 6:24 UTC

57 points

4 comments13 min readLW link

Towards data-centric interpretability with sparse autoencoders

Nick Jiang, lilysun004, lewis smith and Neel Nanda

15 Aug 2025 20:10 UTC

53 points

2 comments18 min readLW link

Neel Nanda MATS Applications Open (Due Aug 29)

Neel Nanda30 Jul 2025 0:55 UTC

22 points

0 comments7 min readLW link

(tinyurl.com)

Reasoning-Finetuning Repurposes Latent Representations in Base Models

Jake Ward, lccqqqqq and Neel Nanda

23 Jul 2025 16:18 UTC

35 points

1 comment2 min readLW link

(arxiv.org)

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan and Neel Nanda

23 Jul 2025 14:57 UTC

79 points

8 comments5 min readLW link

Unfaithful chain-of-thought as nudged reasoning

Paul Bogdan, Uzay Macar, Arthur Conmy and Neel Nanda

22 Jul 2025 22:35 UTC

54 points

3 comments10 min readLW link

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Edward Turner, Anna Soligo, Senthooran Rajamanoharan and Neel Nanda

14 Jul 2025 21:05 UTC

133 points

24 comments5 min readLW link

Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance

Senthooran Rajamanoharan and Neel Nanda

14 Jul 2025 14:52 UTC

80 points

19 comments11 min readLW link

Thought Anchors: Which LLM Reasoning Steps Matter?

Uzay Macar, Paul Bogdan, Neel Nanda and Arthur Conmy

2 Jul 2025 20:16 UTC

35 points

6 comments6 min readLW link

(www.thought-anchors.com)

SAE on activation differences

Santiago Aranguri, jacob_drori and Neel Nanda

30 Jun 2025 17:50 UTC

44 points

3 comments5 min readLW link

What We Learned Trying to Diff Base and Chat Models (And Why It Matters)

Clément Dumas, Julian Minder and Neel Nanda

30 Jun 2025 17:17 UTC

106 points

2 comments7 min readLW link

Agentic Interpretability: A Strategy Against Gradual Disempowerment

beenkim and Neel Nanda

17 Jun 2025 14:52 UTC

17 points

6 comments2 min readLW link

Convergent Linear Representations of Emergent Misalignment

Anna Soligo, Edward Turner, Senthooran Rajamanoharan and Neel Nanda

16 Jun 2025 15:47 UTC

74 points

1 comment8 min readLW link