Thanks for sharing! I found it interesting that the first layer attention was identifying single token sequences, and giving false positives on repeated sequences. Its like if an employee’s job was to determine if he was the only person in the bar, and his heuristic is misfiring when everyone in the bar looks super similar to himself. The clustering extension here was neat. Since gpt2-small uses absolute positional embeddings, I did not observe this here, though I also did not specifically look for it.
The observation you made on key sink neurons aligns with the MLP plots above, where the major contributions to the massive activation comes from a small number of weights.
The results from the simple patching were also neat.
My main conclusion from both your paper and this analysis is that the model is bending over backwards to perform specific handling for an attention sink mechanism, which makes it brittle in contexts like repeated tokens. It is unclear to me why LLMs are not designated a specific set of parameters for modeling attention sinks.
Yes. To clarify further, dimension 447 is only scaled for the first position, since this is the only position where the massive activation occurs. The original line of reasoning was that for the activation to reach value 3000, at some point it was at value 2950 and the gradient pushed it higher. I wanted to better understand why the gradient would keep pushing it higher.
The chain of reasoning goes:
Fiddle with thing that is really big
Observe that attention to the bos_token increases across every layer, head, and position when I make the thing bigger.
Deduce this must imply a positive dot product between every query across all layers, heads, and positions with the bos_token key. In other words, only half of the query space is getting used across the model. Though from an information compression perspective, half of 1 dimension in 768 dimensional space could be considered small.
Conclude that this gradient pressure to drive the massive activation higher is coming from every downstream token, as downstream tokens try to find a sink for their attention.
One interesting follow on: when I downloaded the pretrained GPT2 model and continued training, the massive activation dropped from 3000 to 1000. Perhaps there is something going on with the momentum terms in the optimizer that is a factor in causing the massive activation, since the optimizer was reset when I started training. Open question: Could interventions on the momentum terms in the optimizer lead to improved training dynamics?