Positional kernels of attention heads

Introduction:

In this post, we present a decomposition of attention patterns into position-dependent and content-dependent components, under the assumption that the representation of identical content is approximately translation invariant up to a static layer-specific positional embedding. While this assumption cannot hold exactly due to layer normalization and other non-linearities, we find it useful for analysis. It can be viewed as a logical consequence of the Linear Representation Hypothesis.


Under this assumption, we define a “positional kernel” for attention heads, representing how much attention each position receives, independent of the content at that position. Positional kernels can easily be estimated just by averaging 500 samples from OpenWebText, and the positional kernels we obtain align remarkably well with empirical attention patterns, even in later layers.

The positional kernel lies in the background of content-dependent operations. For instance, you can equally imagine an induction head that attends uniformly to copies of “a b” across the sequence from a query token “a”, and an induction head which only attends to occurrences of “a b” within the previous 20 tokens. The positional kernel would be uniform in the first case, and attend to the previous 20 tokens in the latter.

Our analysis reveals that positional kernels fall into three distinct categories: local attention with sharp decay, slowly decaying attention, and uniform attention across the context window. These patterns are remarkably consistent across different inputs and exhibit translation equivariance, suggesting they serve as fundamental computational primitives in transformer architectures.

Using the positional kernel, we identify attention heads with broad positional kernels and weak content dependence, which we call “contextual attention heads.” Under an independence assumption on the input distribution—which works well in practice—we can use the wide spread of the positional kernel to bound the variance of these heads’ softmax denominators. This allows us to identify first-layer neurons which respond to specific contexts from the training data in a primarily weights-based manner, without having to run models over a text corpus. Some of these contextual neurons arise from multiple contextual attention heads working together, using a shared positional kernel to act in “superposition” with each other.

Informed by our analysis of contextual attention heads, we introduce a measure of the spread of positional kernels called the Effective Token Count (ETC). By plotting the ETC of attention heads across all layers of models, we observe a clear architectural pattern across multiple models (GPT-2 small up to XL): early layers use broad positional patterns to create initial linear summaries of the text before employing local positional kernels to build up n-grams /​ phrases, and later layers transition towards uniform positional kernels.

In the main body, we focus exclusively on models with additive positional embeddings. There is a wide variety of different positional embedding schemes used in LLMs today, each with subtle variations. In the appendix, we show that our analysis straightforwardly applies to T5 Bias and ALiBi positional encoding schemes, but we discuss some difficulties with extending our approach to RoPE models. Our analysis is high-level enough that analogues of the types of heads discussed here occur in all alternative positional schemes we investigated.

Attention decomposition:

We write the post embedding of the input at position in layer as:

Where is a static positional embedding, and is the content-dependent remainder. The expected value is taken over a large text distribution, such as OpenWebText.

The decomposition is always possible, but without the above assumption on translation-invariance of content representation is not intuitive to interpret.

Decomposition of Attention Scores:

For a particular attention head , consider an input sequence , where is the current destination position. For any position , the attention score measures the weight that position places on position :

Where concatenation of letters denotes matrix multiplication, and Q, K are the query and key matrices of head h.

Decomposition of Attention Probabilities:

The exponentiated attention score decomposes into two independent components:

Where:
depends solely on the input embedding of the current position and

depends on the input embedding at position and at position

Positional Patterns:

We define the positional pattern as:

This represents how much attention each position receives based solely on its position, independent of content.

The final softmax probabilities are:

Computing positional patterns:

We compute for each position by sampling from OpenWebText and averaging post-ln1 embeddings. About samples is sufficient for stable patterns.

While positional patterns depend on , they remain largely consistent in practice, except when attention heads attend to the <end-of-text> token, causing emphasis on the first sequence token. We handle this in practice by omitting the first couple sequence positions from our softmax. This is a bit hacky, but the resulting kernels seem to be more consistent, and accurately represent the nature of the attention heads.

Visualization of Positional Patterns:

We identify three common types of positional patterns, shown below. The x-axis represents the key position, and the y-axis shows the query position. We take to be the th position of a chapter from the Bible. As you can see, the patterns are quite consistent across different values, and for the local positional pattern, you can observe the translation equivariance discussed earlier. Similar equivariance emerges for the slowly decaying positional pattern, but the context window required to demonstrate this is too large to show here.

For the rest of this post, when we reference specific heads, they belong to GPT2-Small

Local positional pattern (Head 0.7)
Slow positional decay (Head 0.9)
Uniform positional pattern (Head 0.5)

Positional kernels of the first layer:

Local position patterns
Slow positional decay
Close to uniform

The observed translation equivariance and weak dependence on makes it reasonable to talk about the position kernel of an attention head, rather than as a function of the embedding at position .

Already there are interesting things we can learn from the positional kernels. For instance, in the IOI circuit work, and in subsequent work, both Head 0.1, Head 0.10, and Head 0.5 were identified as duplicate token heads. However the positional kernels make it clear that Head 0.1 and Head 0.10 will attend most to duplicates occuring locally, whereas head 0.5 will attend to duplicates close to uniformly across the sequence.

Head 0.1 and Head 0.10 were the duplicate token heads most active in the IOI circuit, suggesting these heads are used for more grammatical tasks, requiring local attention. Whereas perhaps Head 0.5 is used for detecting repeated tokens far back in the sequence, such as for use by later induction heads.

We show just the first layer positional kernels not because later layers are particularly different, just because there are too many layers to show them all, and the later layers all have positional kernels falling into these basic categories.

Uses of different positional kernels:

Local positional pattern: Ideal for detecting n-grams in early layers. The equivariant pattern ensures n-grams obtain consistent representations regardless of position. Strong positional decay prevents interference from irrelevant parts of the sequence. Generally useful for “gluing together” adjacent position representations.

Slowly decaying positional pattern: Useful for producing local context summaries by averaging over the sequence. Since there are exponentially many possible sequences within a context window, these heads likely produce linear summaries rather than distinguishing specific sequences. Of course can also be used for other tasks, like Head 0.1.

Uniform positional pattern: Used by heads that summarize representations across the entire sequence without positional bias, such as duplicate token heads or induction heads. Also useful for global context processing.

Ruling out superposition:

It’s often hypothesised that attention heads within the same layer may be working in superposition with each other. If attention heads have dramatically different positional kernels, it seems we can immediately rule out superposition between these heads. It doesn’t feel coherent to talk about superposition between an attention head attending locally and an attention head attending globally.

Contextual attention heads:

We now use the positional kernel to analyze a common type of head with a broad positional kernel and weak dependence on content, which we call a Contextual Attention Head.

Approximation of softmax denominator:

For conciseness, we refer to by , not to be confused with the token embedding .

From the softmax probability formula:

This is difficult to analyze because the denominator involves the entire previous sequence.

However, within a fixed context, we can model the sequence as drawn from i.i.d. representations according to some distribution. While nearby representations will correlate, distant representations should have low correlation within a fixed context.

This is a key place where we make use of the assumption that the terms are translation-invariant up to static positional embedding. Without this assumption, the representations drawn from a fixed context can’t be modeled as identically distributed across different positions.

Under these assumptions:

Two key factors determine this variance:
: Measures the spread of positional patterns. For uniform attention across tokens, equals .
: Quantifies variation in content-dependent component.

Heads with slow decay /​global positional patterns have small values for because they are spread out. If they also have low content-dependent variance within a context, the softmax denominator will have low variance.

For a fixed this means the softmax denominator will concentrate around its expected value, effectively becoming a context-dependent constant.

If an attention head has low content-dependent variance across almost all contexts and values, and a broad positional pattern (as measured by ), we call it a “contextual attention head.” Some have very small content-dependent components, appearing visually like fixed positional kernels averaging over previous positions, meaning that we can drop the context-dependent factor. Others are less well behaved, and for instance weigh keywords above stopwords, while still preserving the overall low content-dependent variance.

Contextual attention heads within the same layer as each other that have similar positional kernels are natural candidates for attention head superposition. Within each fixed context, each of the heads is effectively computing a positionally weighted, weakly content-modulated, linear summary of the text. We can combine together these linear summaries across contextual attention heads with similar positional kernels to form a large “contextual circuit.”

First layer contextual circuit:

Recall the positional kernels of the first layer heads from above. Heads 0,1,2,6,8,9, and 10 all have very similar positional kernels. Head 1 is a duplicate token head, meaning there is a high content-dependent variance. The remaining heads empirically have low variance in their softmax denominator. In the first layer we can get a visual for the variance in the softmax denominator for a fixed by simply substituting the corresponding token into the residual stream at position for and plotting the position-normalised softmax denominator.

Below is a plot of the softmax denominators for a diverse variety of input texts with substituted with ′ the’ at each position :

Of the heads with broad positional kernels and low content dependence, Head 6 and Head 8 are the worst behaved. These tend to emphasize words based on whether they are keywords or punctuation, meaning input texts with different keyword densities obtain distinct softmax denominators. Even within the same input text, keyword density can fluctuate. Head 6 is far better behaved than Heads 1, 3, 4, or 7 though, for instance. Head 9 is particularly nicely behaved, as its content-dependent component is close to uniform.

To analyse this contextual circuit we fix . Empirically, the particular doesn’t affect which tokens are emphasized too significantly, so this should give us a good first approximation. Prior work has shown later layers often extract sentiment from attending to stopwords, so understanding how the contextual circuit works for these tokens is useful in its own right.

If we freeze the softmax denominators, and approximate each of Heads 0,2,6,8,9, and 10 as having an identical positional kernel , then we can approximate the combined OV contributions of these heads to the th MLP neuron as , where the contributions depend only on . This approximation allows the transformer to perform Naive Bayes classification of the input text.

There is a wide variance in the softmax denominators of these heads across contexts, as you can see in the table above. But, if we restrict ourselves to prose (as opposed to code), and suppose we have an approximately average keyword density, the percentage error for each head should be at most 25% for some of the worst behaved heads like Head 6 and Head 8, and significantly better for other heads. We expect input texts drawn from the same context to have a range of softmax denominators, so we should expect contextual neurons to be robust to moderate variations in softmax denominator. So we should expect to be able to find a wide variety of contextual neurons under this frozen circuit.

Indeed, we are able to find many interesting neurons with significant contextual components:

  • Spanish Text

  • Political Conspiracy (Neuron 2990)

  • Astronomy (Neuron 508)

  • Commonwealth English vs American English (Neuron 704, Positive token contributions are British spellings and references, and negative token contributions are American spellings and references. Note top dataset activations are just spamming the “£” symbol because this strongly distinguishes British from American, but this neuron does not refer to “countries, regions, and large numerals for financial amounts” )

  • Medieval text

  • Cooking

  • Bracket Matching (Neuron 1121,+1 contribution for tokens containing ‘(’, −1 contribution for tokens containing ‘)’)

And many more contextual neurons.

Metric for spread of positional pattern:

The above analysis naturally suggests as a metric for the spread of positional patterns. Although, as previously mentioned, later layer attention heads often turn themselves off by attending to <end-of-text>, so we should exclude the first or so positions and take the softmax over the remaining positions for this metric.

We define the Effective Token Count (ETC) of a positional kernel to be . If the positional kernel attends uniformly to tokens, it will have an ETC of , giving us a natural interpretation of this definition.

Now we expect local positional patterns to have a low ETC, as they attend to just the previous ~ tokens. A slow positional decay will have a higher ETC, and a uniform positional kernel will have an ETC of .

Language models aren’t very creative with their positional kernels, so the ETC gives a good summary of the type of positional kernel at an attention head.

Reducing the spread to a single summary statistic allows us to produce a single graph giving an idea for the positional kernels across all heads and layers of a single language model.

Below is the heatmap of across all layers and heads of GPT-2 and Tiny stories. I found it best to plot for visualization purposes. Orange-Yellow corresponds to uniform positional patterns. Pink-Purple corresponds to slow positional decay. And Blue corresponds to local positional patterns.

GPT2-Small (n=512)
GPT-2 Medium (n=256)
GPT-2 Large (n=128)
GPT-2 XL (n=256)
TinyStories-28M

For reference, compare the first column of the GPT2-Small heatmap with the plots of the positional patterns given in Section 3. Heads 3, 4, and 7 are in blue because they are local positional patterns. Heads 0,1,2,6,8,9,10 are in magenta as they have a slow positional decay. And Heads 5 and 11 are yellow because they are close to uniform.

We can visually observe interesting things about the layers of GPT2-Small. For instance, notice the density of local positional patterns in the third and fourth layers. Potentially this is from the model extracting local grammatical structure from the text.

On the other hand, the second layer has more slow positional decay /​ uniform positional patterns. In fact, on closer inspection, the second layer has lots of attention heads which act purely as fixed positional kernels, falling into the category of “contextual attention heads” discussed earlier. This suggests the model builds an initial linear summary of the surrounding text, and then begins to build more symbolic representations in the third and fourth layer. We observe a similar pattern in GPT2-Medium, and to some extent in GPT2-XL.

Layer 5 is known for having many induction heads: Heads 5.0, 5.1, and 5.5 are known to be induction heads. These stick out visually as having close to uniform positional patterns, which validates the intuition that induction heads tend not to care about position.

The fact that there are so many local positional patterns in layers 2-4 gives a potential explanation for the small number of interesting specialized heads found in these layers. Attention heads with uniform positional kernels like induction heads feel more likely to be selected for “interesting behaviour”, than heads which attend only locally.

Conclusion:

It seems like positional kernels are a useful notion to look at when first assessing attention heads, and they suggest many different lines of inquiry. One interesting piece of future work could be looking at how these positional kernels develop over the course of training.

However, the assumption made at the start of the post has not been validated, and it’d be important to look at this in future work.

This Google Colab contains the code required to reproduce the results found here.

Appendix:

Alternative positional schemes:

There are many different positional schemes used for different LLMs, but the three most common alternatives to additive positional embeddings are:

  • T5:Adds a learnt relative bias to the pre-softmax attention scores based on the relative distance between the positions of the current token and the token being attended to.

  • ALiBi: Similar to T5, adds a linear bias to the pre-softmax attention scores proportional to the relative distance between the current token and the token being attended to. The slope of this bias is hard-coded, not learnt.

  • RoPE: Applies a complex rotation of to the th and th dimensions of the query and key vectors, up to the th index. varies depending on implementation: some implementations take , where is the dimension of the query and key vectors. The most common model which uses RoPE is the Pythia line of models, which uses for efficiency purposes.

T5 and AliBi are pretty straightforward to handle under this framework. The relative bias terms can simply be added to the EQKP+PQKP terms before taking the softmax to compute the positional kernel. We can’t assume that is constant, because models are still able to develop emergent positional embeddings, as we see in NoPE models, which they may use to overcome the limitations of these positional schemes.

RoPE is more difficult to work with. The main trouble with analyzing RoPE is that while it allows for relative positional indexing, the rotations of queries and keys can potentially disrupt semantic attention (as noted in Round and Round we go). Round and Round we go showed that RoPE models with fully rotary dimension seem to use low-frequency dimensions to create a “semantic channel”, via which models can learn content-content interactions while minimising disruption from query and key rotations. However, this semantic channel is not robust, because the low frequency rotations still disrupt content-content interactions over a sufficiently long context window.

Round and Round we Go suggested p-RoPE as a solution to this problem, where only a percentage of the highest frequency query and key dimensions are rotated, with the remaining dimensions left without rotation. They found that taking , models perform better on long-context tasks than for . Interestingly, they found that , corresponding to Pythia, performs worse than . The analysis we give here is based on empirical results from Pythia models, but the ideas should apply for p-RoPE models.

The motivation for studying p-RoPE models, then, is that these models can perform better than vanilla RoPE models, and they have concrete semantic channels given by the non-rotating dimensions, allowing us to attempt a
separation of content and position.

One might hope that p-RoPE models would use the non-rotating dimensions exclusively as semantic channels, and it seems like they do this. But, they also use the rotating key dimensions in non-trivial ways that seem like they would require a project of their own.

For now, note simply that if the rotating key bias dominated over the keys, making the rotating key dimensions close to constant, we could recover a positional kernel. Some attention heads seem like they do this but it’s not that clean.

No comments.