RSS

Neel Nanda

Karma: 13,169

[Paper] Difficul­ties with Eval­u­at­ing a De­cep­tion De­tec­tor for AIs

3 Dec 2025 20:07 UTC
30 points
1 comment5 min readLW link
(arxiv.org)

How Can In­ter­pretabil­ity Re­searchers Help AGI Go Well?

1 Dec 2025 13:05 UTC
62 points
1 comment14 min readLW link

A Prag­matic Vi­sion for Interpretability

1 Dec 2025 13:05 UTC
127 points
34 comments27 min readLW link

Cur­rent LLMs seem to rarely de­tect CoT tampering

19 Nov 2025 15:27 UTC
51 points
0 comments20 min readLW link