Edit: See Decomposing Attention to Find Context-Sensitive Neurons.

Decomposition of attention patterns:

LayerNorm is a problem for understanding models, because it makes it hard to analyze content and position separately. Thankfully, the model wants to separate content and position, and so it mostly learns to mitigate this effect.

To handle LayerNorm, approximate the keys of the input at position $i$ for head $h$ on input $x$ as:

$\frac{key [i] (x)}{\sqrt{d_{model}}} = \frac{W_{E} [x_{i}] W_{K} + W_{pos} [i] W_{K}}{| W_{E} [x_{i}] + W_{pos} [i] |} \approx (\frac{W_{E} [x_{i}] W_{K} + W_{pos} [n] W_{K}}{| W_{E} [x_{i}] + W_{pos} [n] |}) + (\frac{W_{pos} [i] W_{K} - W_{pos} [n] W_{K}}{\sqrt{| W_{pos} [i] |^{2} + {3.5}^{2}}}) = \frac{E [n, x_{i}] + P [n, i]}{\sqrt{d_{model}}}$

When these approximate keys are substituted in, the resulting approximate attention pattern is close to the true attention pattern, with a low total variation distance of about $0.05$ typically. This means that typically the true attention pattern, and the attention pattern obtained when using our modified keys, differ by a shift of 5% in attention mass.

TV distance between true and reconstructed attention patterns across sequence positions for the 6 attention heads analyzed from the first layer of GPT2-Small. Results are shown for a representative text from OpenWebText. The approximation maintains low TV (typically $\sim 0.05$ ) across all heads shown, with similar performance observed across all tested texts.

In practice, the approximate and true attention patterns end up being visually indistinguishable.

Positional kernels:

Define the positional kernel at position $n$ as:
$\begin{matrix} {pos}_{n, i, query [n] (x)} & = Softmax {(\frac{query [n] (x)^{T} P [1 : n + 1]}{\sqrt{d_{value}}})}_{i} \end{matrix}$

And define:

${content}_{E [i] (x), query [n] (x)} = e^{\frac{query [n] (x)^{T} E [i] (x)}{\sqrt{d_{value}}}}$

The attention pattern attending from position n when using the approximate keys is given by:
$\begin{matrix} attn_approx [n, i] (x) & = \frac{e^{{attn_approx}_{score} [n, i] (x)}}{\sum_{j = 1}^{n} e^{{attn_approx}_{score} [n, j] (x)}} = \frac{{pos}_{n, i, query [n] (x)} \cdot {content}_{E [i] (x), query [n] (x)}}{\sum_{j = 1}^{n} {pos}_{n, j, query [n] (x)} \cdot {content}_{E [j] (x), query [n] (x)}} \end{matrix}$

There are three categories of positional kernels in the first layer of GPT2-Small, shown below. These kernels are often translation equivariant, and they depend weakly on the particular query token.

Contextual circuit:

Stability of softmax denominators:

We have the softmax probability formula:

$attn_approx [n, i] (x) = \frac{{pos}_{n, i, query [n] (x)} \cdot {content}_{E [i] (x), query [n] (x)}}{\sum_{j = 1}^{n} {pos}_{n, j, query [n] (x)} \cdot {content}_{E [j] (x), query [n] (x)}}$

If we model the tokens in the sequence as drawn i.i.d according to some underlying distribution, we can apply concentration inequalities such as Hoeffding/Chebyshev to bound the probability that the softmax denominator deviates too far from its mean. In both cases, concentration is governed by $\sum_{i = 1}^{n} {pos}_{i}^{2}$ together with a term measuring the content dependence of the attention scores.

$\sum_{i = 1}^{n} {pos}_{i}^{2}$ will be small when the positional kernel is spread out. So heads with wide positional kernels and attention scores depending weakly on content should have stable softmax denominators.

The sequences used are drawn either from OpenWebText or from well-known books. There is an overall input-independent decay in the denominator for many of the heads, but the stability can be seen nonetheless. The worst behaved heads are Head $6$ and Head $8$ , which depend on local keyword density, which fluctuates, but these heads are still significantly better behaved than heads with local positional kernels.

The relative stability of the softmax denominators for each of these heads tells us that their denominators are macroscopic properties of the surrounding text. Thus, we need to inject some information about the input distribution if we want to find the “effective circuit” which the model uses with high probability. We can’t work in a fully weights-based manner, in other words.

We can inject this information by picking a representative calibration text for the distribution we are interested in, and sampling softmax denominators from this text. Once these softmax denominators have been chosen, because the heads (0, 2, 6, 8, 9, 10) have very similar positional kernels, we will be able to approximate their combined output by a positionally weighted summary $\sum_{i = 1}^{n} {pos}_{n, i, query [n] (x)} contribution [n, query [n] (x), x_{i}]$ . This summary will be a function of our calibrated softmax denominators, and I call this approximation the “contextual circuit”.

Contextual neurons:

Using the contextual circuit, we can efficiently identify neurons that respond to high-level contextual properties.

First, calibrate our circuit using a text with average keyword density and fix the query token to ′ the’ (which appears in many contexts). This one-time calibration gives us the softmax denominators we need.

Once calibrated, we can work entirely with the model weights—no more forward passes needed. Our frozen contextual circuit becomes a purely mathematical function that we can evaluate for any token.

For each of the 3072 first-layer MLP neurons, compute the “maximum token contribution” - $max [j] = {sup}_{t \in [0, d_{voc}]} contribution [n, query [n] (x), t] mlp [:, j]$ , for the $j$ th neuron. Neurons with high maximum token contributions (I use threshold 5.0, leaving about 100 neurons) are sensitive to broad contextual patterns rather than just immediate neighboring tokens.

To understand what each contextual neuron detects, we can examine which tokens contribute most and least to it through our circuit. The vast majority correspond to interpretable patterns like particular topics, languages, or sentiment, from the training set.

So, after calibrating on one text, we can discover hundreds of contextual neurons using only weight-based analysis. The neurons found fire on contexts completely unrelated to the calibration text.

Commonwealth vs American English (Neuron 704):

Top 50 token contributions:
[′ pract’, ′ foc’, ′ recogn’, ′ UK’, ′ British’, ′ London’, ′ £‘, ’ Australia’, ′ Britain’, ‘isation’, ′ Australian’, ′ emphas’, ′ favour’, ′ Labour’, ′ centre’, ′ util’, ′ BBC’, ′ Scotland’, ′ behaviour’, ′ defence’, ′ Manchester’, ′ colour’, ′ €‘, ’ labour’, ′ analys’, ′ programme’, ′ Liverpool’, ′ Wales’, ′ Sydney’, ′ Scottish’, ′ neighbour’, ′ favourite’, ′ organisation’, ′ keen’, ′ organis’, ′ offence’, ′ whilst’, ′ Melbourne’, ′ MPs’, ‘£’, ′ honour’, ′ summar’, ′ organisations’, ′ Isis’, ′ travelling’, ′ Defence’, ′ licence’, ′ NHS’, ′ Dublin’, ′ armour’]

Bottom 50 token contributions:
[′ program’, ′ mom’, ′ favor’, ′ color’, ′ Center’, ′ center’, ′ defense’, ′ toward’, ′ Texas’, ′ favorite’, ′ organization’, ′ programs’, ′ behavior’, ′ analy’, ‘izes’, ‘izations’, ′ neighbor’, ′ labor’, ′ Color’, ‘avor’, ′ marijuana’, ′ organizations’, ′ Defense’, ′ license’, ′ attorney’, ′ neighborhood’, ′ realize’, ′ GOP’, ′ offense’, ′ realized’, ′ Seattle’, ′ honor’, ′ recognize’, ′ colors’, ′ gotten’, ′ folks’, ′ recognized’, ‘color’, ′ armor’, ′ organized’, ′ §‘, ’ Oregon’, ‘sylvania’, ′ baseball’, ′ transportation’, ′ Iowa’, ′ downtown’, ′ flavor’, ‘.--‘, ’ accomplish’]

This neuron classifies text as using Commonwealth English or American English, activating on Commonwealth English. It is an example where the bottom token contributions are as important to the function of the neuron as the top token contributions. Together, these token contributions allow the model to perform a kind of Naive Bayes classification of the text.

Positional kernels of attention heads

Edit: See Decomposing Attention to Find Context-Sensitive Neurons.

Decomposition of attention patterns:

Contextual circuit:

Stability of softmax denominators: