RSS

Neel Nanda

Karma: 14,633

How well do mod­els fol­low their con­sti­tu­tions?

12 Mar 2026 0:07 UTC
94 points
5 comments26 min readLW link

Cen­sored LLMs as a Nat­u­ral Testbed for Se­cret Knowl­edge Elicitation

9 Mar 2026 18:50 UTC
30 points
2 comments5 min readLW link

Cur­rent ac­ti­va­tion or­a­cles are hard to use

3 Mar 2026 19:33 UTC
77 points
3 comments16 min readLW link

How to De­sign En­vi­ron­ments for Un­der­stand­ing Model Motives

2 Mar 2026 7:14 UTC
42 points
0 comments10 min readLW link

Why Did My Model Do That? Model In­crim­i­na­tion for Di­ag­nos­ing LLM Misbehavior

27 Feb 2026 3:20 UTC
51 points
1 comment78 min readLW link

mod­els have some pretty funny at­trac­tor states

12 Feb 2026 21:14 UTC
266 points
38 comments18 min readLW link

It Is Rea­son­able To Re­search How To Use Model In­ter­nals In Training

Neel Nanda8 Feb 2026 3:44 UTC
101 points
15 comments4 min readLW link

Test your in­ter­pretabil­ity tech­niques by de-cen­sor­ing Chi­nese models

15 Jan 2026 16:33 UTC
90 points
14 comments20 min readLW link

Global CoT Anal­y­sis: Ini­tial at­tempts to un­cover pat­terns across many chains of thought

13 Jan 2026 20:40 UTC
52 points
0 comments18 min readLW link

Brief Ex­plo­ra­tions in LLM Value Rankings

12 Jan 2026 18:16 UTC
37 points
1 comment11 min readLW link

Prin­ci­pled In­ter­pretabil­ity of Re­ward Hack­ing in Closed Fron­tier Models

1 Jan 2026 16:37 UTC
24 points
0 comments23 min readLW link

Steer­ing RL Train­ing: Bench­mark­ing In­ter­ven­tions Against Re­ward Hacking

29 Dec 2025 21:55 UTC
58 points
10 comments19 min readLW link

An­nounc­ing Gemma Scope 2

22 Dec 2025 21:56 UTC
95 points
1 comment2 min readLW link

Can we in­ter­pret la­tent rea­son­ing us­ing cur­rent mechanis­tic in­ter­pretabil­ity tools?

22 Dec 2025 16:56 UTC
37 points
0 comments9 min readLW link

[Paper] Difficul­ties with Eval­u­at­ing a De­cep­tion De­tec­tor for AIs

3 Dec 2025 20:07 UTC
30 points
2 comments6 min readLW link
(arxiv.org)

How Can In­ter­pretabil­ity Re­searchers Help AGI Go Well?

1 Dec 2025 13:05 UTC
66 points
1 comment14 min readLW link

A Prag­matic Vi­sion for Interpretability

1 Dec 2025 13:05 UTC
131 points
39 comments27 min readLW link

Cur­rent LLMs seem to rarely de­tect CoT tampering

19 Nov 2025 15:27 UTC
55 points
0 comments20 min readLW link

Steer­ing Eval­u­a­tion-Aware Models to Act Like They Are Deployed

30 Oct 2025 15:03 UTC
61 points
12 comments18 min readLW link

Nar­row Fine­tun­ing Leaves Clearly Read­able Traces in Ac­ti­va­tion Differences

5 Sep 2025 12:11 UTC
54 points
2 comments7 min readLW link