RSS

Neel Nanda

Karma: 13,722

Test your in­ter­pretabil­ity tech­niques by de-cen­sor­ing Chi­nese models

15 Jan 2026 16:33 UTC
79 points
10 comments20 min readLW link

Global CoT Anal­y­sis: Ini­tial at­tempts to un­cover pat­terns across many chains of thought

13 Jan 2026 20:40 UTC
48 points
0 comments18 min readLW link

Brief Ex­plo­ra­tions in LLM Value Rankings

12 Jan 2026 18:16 UTC
36 points
1 comment11 min readLW link

Prin­ci­pled In­ter­pretabil­ity of Re­ward Hack­ing in Closed Fron­tier Models

1 Jan 2026 16:37 UTC
22 points
0 comments23 min readLW link

Steer­ing RL Train­ing: Bench­mark­ing In­ter­ven­tions Against Re­ward Hacking

29 Dec 2025 21:55 UTC
47 points
10 comments19 min readLW link

An­nounc­ing Gemma Scope 2

22 Dec 2025 21:56 UTC
94 points
1 comment2 min readLW link

Can we in­ter­pret la­tent rea­son­ing us­ing cur­rent mechanis­tic in­ter­pretabil­ity tools?

22 Dec 2025 16:56 UTC
33 points
0 comments9 min readLW link