Yes. To clarify further, dimension 447 is only scaled for the first position, since this is the only position where the massive activation occurs. The original line of reasoning was that for the activation to reach value 3000, at some point it was at value 2950 and the gradient pushed it higher. I wanted to better understand why the gradient would keep pushing it higher.
The chain of reasoning goes:
Fiddle with thing that is really big
Observe that attention to the bos_token increases across every layer, head, and position when I make the thing bigger.
Deduce this must imply a positive dot product between every query across all layers, heads, and positions with the bos_token key. In other words, only half of the query space is getting used across the model. Though from an information compression perspective, half of 1 dimension in 768 dimensional space could be considered small.
Conclude that this gradient pressure to drive the massive activation higher is coming from every downstream token, as downstream tokens try to find a sink for their attention.
One interesting follow on: when I downloaded the pretrained GPT2 model and continued training, the massive activation dropped from 3000 to 1000. Perhaps there is something going on with the momentum terms in the optimizer that is a factor in causing the massive activation, since the optimizer was reset when I started training. Open question: Could interventions on the momentum terms in the optimizer lead to improved training dynamics?
Highly underrated post!
You scale dimension 447 (the largest), because you hypothesize that it is correlated with the bos token since it has the largest activation?
Yes. To clarify further, dimension 447 is only scaled for the first position, since this is the only position where the massive activation occurs. The original line of reasoning was that for the activation to reach value 3000, at some point it was at value 2950 and the gradient pushed it higher. I wanted to better understand why the gradient would keep pushing it higher.
The chain of reasoning goes:
Fiddle with thing that is really big
Observe that attention to the bos_token increases across every layer, head, and position when I make the thing bigger.
Deduce this must imply a positive dot product between every query across all layers, heads, and positions with the bos_token key. In other words, only half of the query space is getting used across the model. Though from an information compression perspective, half of 1 dimension in 768 dimensional space could be considered small.
Conclude that this gradient pressure to drive the massive activation higher is coming from every downstream token, as downstream tokens try to find a sink for their attention.
One interesting follow on: when I downloaded the pretrained GPT2 model and continued training, the massive activation dropped from 3000 to 1000. Perhaps there is something going on with the momentum terms in the optimizer that is a factor in causing the massive activation, since the optimizer was reset when I started training. Open question: Could interventions on the momentum terms in the optimizer lead to improved training dynamics?