Ph.D. @ Sapienza University. In love with AI and Statistics, now getting interested in AI Safety.
Matteo Migliarini
Karma: 11
The Self-Hating Attention Head: A Deep Dive in GPT-2
TL;DR: GPT-2′s head 1.5 directs attention to semantically similar tokens and actively suppresses self-attention. This mechanism is driven by a symmetric bilinear form with negative eigenvalues, which enables suppression. The head computes attention purely based on token identity, independent of position. We cluster tokens semantically, interpret the weights to explain the attention scores, and steer self-suppression by tuning eigenvalues.
I’d like to aknowledge that I did this work during my time at ARENA, but I’m not sure whether I should do it at beginning or end
I will replace this image