Thanks for sharing! I found it interesting that the first layer attention was identifying single token sequences, and giving false positives on repeated sequences. Its like if an employee’s job was to determine if he was the only person in the bar, and his heuristic is misfiring when everyone in the bar looks super similar to himself. The clustering extension here was neat. Since gpt2-small uses absolute positional embeddings, I did not observe this here, though I also did not specifically look for it.
The observation you made on key sink neurons aligns with the MLP plots above, where the major contributions to the massive activation comes from a small number of weights.
The results from the simple patching were also neat.
My main conclusion from both your paper and this analysis is that the model is bending over backwards to perform specific handling for an attention sink mechanism, which makes it brittle in contexts like repeated tokens. It is unclear to me why LLMs are not designated a specific set of parameters for modeling attention sinks.
Thanks for sharing! I found it interesting that the first layer attention was identifying single token sequences, and giving false positives on repeated sequences. Its like if an employee’s job was to determine if he was the only person in the bar, and his heuristic is misfiring when everyone in the bar looks super similar to himself. The clustering extension here was neat. Since gpt2-small uses absolute positional embeddings, I did not observe this here, though I also did not specifically look for it.
The observation you made on key sink neurons aligns with the MLP plots above, where the major contributions to the massive activation comes from a small number of weights.
The results from the simple patching were also neat.
My main conclusion from both your paper and this analysis is that the model is bending over backwards to perform specific handling for an attention sink mechanism, which makes it brittle in contexts like repeated tokens. It is unclear to me why LLMs are not designated a specific set of parameters for modeling attention sinks.