jacob_drori

Karma: 821

Weight-Sparse Circuits May Be Interpretable Yet Unfaithful

jacob_drori9 Feb 2026 23:25 UTC

136 points

5 comments8 min readLW link

[Question] Does the USG have access to smarter models than the labs’?

jacob_drori29 Dec 2025 22:51 UTC

9 points

5 comments1 min readLW link

[Paper] Output Supervision Can Obfuscate the CoT

jacob_drori, lukemarks, cloud and TurnTrout

20 Nov 2025 22:41 UTC

92 points

3 comments5 min readLW link

(arxiv.org)

jacob_drori’s Shortform

jacob_drori1 Aug 2025 17:47 UTC

7 points

6 comments1 min readLW link

[Research Note] Optimizing The Final Output Can Obfuscate CoT

lukemarks, jacob_drori, cloud and TurnTrout

30 Jul 2025 21:26 UTC

201 points

23 comments6 min readLW link

SAE on activation differences

Santiago Aranguri, jacob_drori and Neel Nanda

30 Jun 2025 17:50 UTC

45 points

3 comments5 min readLW link

Sparsely-connected Cross-layer Transcoders

jacob_drori18 Jun 2025 17:13 UTC

51 points

3 comments12 min readLW link

There is a globe in your LLM

jacob_drori8 Oct 2024 0:43 UTC

91 points

4 comments1 min readLW link

Domain-specific SAEs

jacob_drori7 Oct 2024 20:15 UTC

28 points

2 comments5 min readLW link

Open Source Automated Interpretability for Sparse Autoencoder Features

kh4dien, SrGonao, jacob_drori and Nora Belrose

30 Jul 2024 21:11 UTC

67 points

1 comment13 min readLW link

(blog.eleuther.ai)