Neel Nanda

Karma: 15,947

Towards surfacing model algorithms with meta-tokens in the J-Space

agam_bhatia, camilablank and Neel Nanda

20 Jul 2026 19:45 UTC

45 points

0 comments10 min readLW link

Data filtering works a lot worse than you would expect

Dohun Lee, J Rosser, Josh Engels and Neel Nanda

7 Jul 2026 4:41 UTC

53 points

8 comments3 min readLW link

A Review of Anthropic’s Global Workspace Paper

Neel Nanda6 Jul 2026 20:59 UTC

133 points

6 comments25 min readLW link

The Case for Model Forensics

aditya singh, gersonkroiz, Senthooran Rajamanoharan and Neel Nanda

26 Jun 2026 15:09 UTC

47 points

0 comments10 min readLW link

LLM-Driven Feature Discovery

Josh Engels, bilalchughtai and Neel Nanda

22 Jun 2026 22:26 UTC

35 points

1 comment5 min readLW link

How transparent is DiffusionGemma (and why it matters)

Josh Engels, Callum McDougall, bilalchughtai, János Kramár, Senthooran Rajamanoharan, Arthur Conmy, Rohin Shah and Neel Nanda

20 Jun 2026 20:05 UTC

86 points

2 comments4 min readLW link

Synthetic document finetuning for instilling positive traits

CallumMcDougall, Arthur Conmy and Neel Nanda

16 Jun 2026 0:04 UTC

62 points

1 comment10 min readLW link

Why Do Naive SFT Filters For Safety Properties Fail?

Josh Engels and Neel Nanda

14 Jun 2026 19:45 UTC

60 points

7 comments10 min readLW link

SFT Drives Gemini’s Safety Properties

Josh Engels, Arthur Conmy, bilalchughtai and Neel Nanda

13 Jun 2026 15:31 UTC

90 points

4 comments1 min readLW link

Building and evaluating model diffing agents

bilalchughtai, Josh Engels and Neel Nanda

12 Jun 2026 17:14 UTC

62 points

2 comments12 min readLW link

Models May Behave Worse When Eval Aware

Senthooran Rajamanoharan and Neel Nanda

11 Jun 2026 9:28 UTC

90 points

8 comments13 min readLW link

Building Better Activation Oracles

ceselder, Jan Bauer, Niclas Luick, Adam Karvonen and Neel Nanda

4 Jun 2026 18:34 UTC

67 points

1 comment7 min readLW link

(arxiv.org)

Test your best methods on our hard CoT interp tasks

daria, Riya Tyagi, Josh Engels and Neel Nanda

26 Mar 2026 19:24 UTC

59 points

2 comments19 min readLW link

How well do models follow their constitutions?

aryaj, Senthooran Rajamanoharan and Neel Nanda

12 Mar 2026 0:07 UTC

100 points

5 comments26 min readLW link

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Bartosz Cywiński, Helena Casademunt, Khoi Tran, aryaj, Sam Marks and Neel Nanda

9 Mar 2026 18:50 UTC

39 points

3 comments5 min readLW link

Current activation oracles are hard to use

aryaj, Senthooran Rajamanoharan and Neel Nanda

3 Mar 2026 19:33 UTC

83 points

4 comments16 min readLW link

How to Design Environments for Understanding Model Motives

gersonkroiz, aditya singh, Senthooran Rajamanoharan and Neel Nanda

2 Mar 2026 7:14 UTC

51 points

0 comments10 min readLW link

Why Did My Model Do That? Model Forensics for Diagnosing LLM Misbehavior

aditya singh, gersonkroiz, Senthooran Rajamanoharan and Neel Nanda

27 Feb 2026 3:20 UTC

60 points

12 comments25 min readLW link

models have some pretty funny attractor states

aryaj, Senthooran Rajamanoharan and Neel Nanda

12 Feb 2026 21:14 UTC

277 points

38 comments18 min readLW link

It Is Reasonable To Research How To Use Model Internals In Training

Neel Nanda8 Feb 2026 3:44 UTC

122 points

15 comments4 min readLW link