ntt123

Karma: 15

ntt123 18 Jun 2024 1:38 UTC
3 points
0
in reply to: Logan Riggs’s comment on: Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability
Thank you for the upvote! My main frustration with logit lens and tuned lens is that these methods are kind of ad hoc and do not reflect component contributions in a mathematically sound way. We should be able to rewrite the output as a sum of individual terms, I told myself.
For the record, I did not assume MLP neurons are monosemantic or polysemantic, and this is why I did not mention SAEs.

Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability

ntt12317 Jun 2024 11:46 UTC

5 points

4 comments6 min readLW link

(neuralblog.github.io)

Exploring Llama-3-8B MLP Neurons

ntt1239 Jun 2024 14:19 UTC

10 points

0 comments4 min readLW link

(neuralblog.github.io)