Joseph Miller comments on interpreting GPT: the logit lens

Joseph Miller 13 Feb 2025 20:51 UTC
4 points
1
I just found the paper BERT’s output layer recognizes all hidden layers? Some Intriguing Phenomena and a simple way to boost BERT, which precedes this post by a few months and invents essentially the same technique as the logit lens.
So consider also citing that paper when citing this post.
As an aside, I would guess that this is the most cited LessWrong post in the academic literature, but it would be cool if anyone had stats on that.