Matteo Migliarini

Karma: 11

Ph.D. @ Sapienza University. In love with AI and Statistics, now getting interested in AI Safety.

The Self-Hating Attention Head: A Deep Dive in GPT-2

Matteo Migliarini4 Jul 2025 13:07 UTC

12 points

0 comments7 min readLW link

Matteo Migliarini 11 Jun 2025 13:20 UTC
1 point
0
on: [Error communicating with LW2 server]
A lower score means a better match. As expected, a survey of all heads in the model shows that head 1.5 is an outlier with a uniquely low score, confirming it’s specialized for this task.
I will replace this image

Matteo Migliarini 11 Jun 2025 13:19 UTC
1 point
0
on: [Error communicating with LW2 server]
TL;DR: GPT-2′s head 1.5 directs attention to semantically similar tokens and actively suppresses self-attention. This mechanism is driven by a symmetric bilinear form with negative eigenvalues, which enables suppression. The head computes attention purely based on token identity, independent of position. We cluster tokens semantically, interpret the weights to explain the attention scores, and steer self-suppression by tuning eigenvalues.
I’d like to aknowledge that I did this work during my time at ARENA, but I’m not sure whether I should do it at beginning or end