RSS

Neel Nanda

Karma: 14,612

How well do mod­els fol­low their con­sti­tu­tions?

12 Mar 2026 0:07 UTC
89 points
5 comments26 min readLW link

Cen­sored LLMs as a Nat­u­ral Testbed for Se­cret Knowl­edge Elicitation

9 Mar 2026 18:50 UTC
30 points
2 comments5 min readLW link

Cur­rent ac­ti­va­tion or­a­cles are hard to use

3 Mar 2026 19:33 UTC
77 points
3 comments16 min readLW link

How to De­sign En­vi­ron­ments for Un­der­stand­ing Model Motives

2 Mar 2026 7:14 UTC
40 points
0 comments10 min readLW link

Why Did My Model Do That? Model In­crim­i­na­tion for Di­ag­nos­ing LLM Misbehavior

27 Feb 2026 3:20 UTC
49 points
1 comment78 min readLW link