Huh interesting about the backup heads in GPT-Neo! I would not expect a dropout-less model to have that—some ideas to consider:
the backup heads could have other main functions but incidentally are useful for the specific task we’re looking at, so they end up taking the place of the main heads
thinking of virtual attention heads, the computations performed are not easily interpretable at the individual head-level once you have a lot of layers, sort of like how neurons aren’t interpretable in big models due to superposition
Re: GPT-Neo being weird, one of the colabs in the original logit lens post shows that logit lens is pretty decent for standard GPT-2 of varying sizes but basically useless for GPT-Neo, i.e. outputs some extremely unlikely tokens for every layer before the last one. The bigger GPT-Neos are a bit better (some layers are kinda intepretable with logit lens) but still bad. Basically, the residual stream is just in a totally wacky basis until the last layer’s computations, unlike GPT-2 which shows more stability (the whole reason logit lens works).
One weird thing I noticed with GPT-Neo 125M’s embedding matrix is that the input static embeddings are super concentrated in vector space, avg. pairwise cosine similarity is 0.960 compared to GPT-2 small’s 0.225.
On the later layers not doing much, I saw some discussion on the EleutherAI discord that probes can recover really good logit distributions from the middle layers of the big GPT-Neo models. I haven’t looked into this more myself so I don’t know how it compares to GPT-2. Just seems to be an overall profoundly strange model.
GPT-Neo might be weird because it was trained without dropout iirc. In general, it seems to be a very unusual model compared to others of its size; e.g. logit lens totally fails on it, and probing experiments find most of its later layers add very little information to its logit predictions. Relatedly, I would think dropout is responsible for backup heads existing and taking over if other heads are knocked out.