RSS

Neel Nanda

Karma: 15,384

LLM-Driven Fea­ture Discovery

22 Jun 2026 22:26 UTC
17 points
1 comment5 min readLW link

How trans­par­ent is Diffu­sionGemma (and why it mat­ters)

20 Jun 2026 20:05 UTC
79 points
2 comments4 min readLW link

Syn­thetic doc­u­ment fine­tun­ing for in­still­ing pos­i­tive traits

16 Jun 2026 0:04 UTC
60 points
1 comment10 min readLW link

Why Do Naive SFT Filters For Safety Prop­er­ties Fail?

14 Jun 2026 19:45 UTC
51 points
7 comments10 min readLW link

SFT Drives Gem­ini’s Safety Properties

13 Jun 2026 15:31 UTC
78 points
4 comments1 min readLW link

Build­ing and eval­u­at­ing model diffing agents

12 Jun 2026 17:14 UTC
61 points
2 comments12 min readLW link

Models May Be­have Worse When Eval Aware

11 Jun 2026 9:28 UTC
87 points
8 comments13 min readLW link

Build­ing Bet­ter Ac­ti­va­tion Oracles

4 Jun 2026 18:34 UTC
63 points
1 comment7 min readLW link

Test your best meth­ods on our hard CoT in­terp tasks

26 Mar 2026 19:24 UTC
59 points
2 comments19 min readLW link

How well do mod­els fol­low their con­sti­tu­tions?

12 Mar 2026 0:07 UTC
100 points
5 comments26 min readLW link

Cen­sored LLMs as a Nat­u­ral Testbed for Se­cret Knowl­edge Elicitation

9 Mar 2026 18:50 UTC
39 points
3 comments5 min readLW link

Cur­rent ac­ti­va­tion or­a­cles are hard to use

3 Mar 2026 19:33 UTC
83 points
4 comments16 min readLW link

How to De­sign En­vi­ron­ments for Un­der­stand­ing Model Motives

2 Mar 2026 7:14 UTC
46 points
0 comments10 min readLW link

Why Did My Model Do That? Model Foren­sics for Di­ag­nos­ing LLM Misbehavior

27 Feb 2026 3:20 UTC
60 points
12 comments25 min readLW link

mod­els have some pretty funny at­trac­tor states

12 Feb 2026 21:14 UTC
275 points
38 comments18 min readLW link

It Is Rea­son­able To Re­search How To Use Model In­ter­nals In Training

Neel Nanda8 Feb 2026 3:44 UTC
103 points
15 comments4 min readLW link

Test your in­ter­pretabil­ity tech­niques by de-cen­sor­ing Chi­nese models

15 Jan 2026 16:33 UTC
91 points
14 comments20 min readLW link

Global CoT Anal­y­sis: Ini­tial at­tempts to un­cover pat­terns across many chains of thought

13 Jan 2026 20:40 UTC
52 points
0 comments18 min readLW link

Brief Ex­plo­ra­tions in LLM Value Rankings

12 Jan 2026 18:16 UTC
39 points
1 comment11 min readLW link

Prin­ci­pled In­ter­pretabil­ity of Re­ward Hack­ing in Closed Fron­tier Models

1 Jan 2026 16:37 UTC
25 points
0 comments23 min readLW link