Overview:

There is an interesting mechanism GPT-2 appears to use to measure distances between duplicate tokens. The mechanism reminds me a lot of Twisted pair cabling in communications.

The mechanism is fiddly to explain in context, so i’ve tried to abstract out most of the details, and give a clean toy version of the mechanism. I think some of the structures for this mechanism could be used in other contexts than computing distances.

Setup:

We have the set of points ${x_{i}}_{i = 0}^{n} \in R^{d} \subset R^{d_{model}}$ , with $x_{i} = f (i / n)$ where $f : [0, 1] \to R^{d}$ is a smooth function. We take as input $x_{i}$ , $x_{j}$ with $| i - j | \leq k << n$ . We want to construct a transformer which can estimate $| i - j |$ given $x_{i}, x_{j}$ . For this transformer, we additionally assume that $d$ is relatively small compared to the embedding dimension $d_{model}$ .

The mechanism:

We set up $M + 1 < n$ “sentry points” ${s_{i}}_{i = 0}^{M}$ uniformly along $[0, 1]$ . And we define $g : [0, 1] \to [0, M] \cap Z$ sending $t \in [0, 1]$ to the index of the closest sentry point.

Then we have, $| | x_{i} - x_{j} - (\frac{i}{n} - \frac{j}{n}) f^{'} (s_{g (\frac{i}{n})}) | |_{2} = | | \int_{\frac{j}{n}}^{\frac{i}{n}} f^{'} (t) - f^{'} (s) d t | |_{2} \leq L (\frac{1}{2 M} + \frac{| i - j |}{n}) \cdot \frac{| i - j |}{n}$

where $L$ is such that $| | f^{'} (x) - f^{'} (y) | |_{2} \leq L | x - y |$ for all $x, y \in [0, 1]$ .

So $| \frac{(x_{i} - x_{j}) \cdot f^{'} (s_{g (\frac{i}{n})})}{| | f^{'} (s_{g (\frac{i}{n})}) | |_{2}^{2}} - (i - j) | \leq \frac{| | \int_{\frac{j}{n}}^{\frac{i}{n}} f^{'} (t) - f^{'} (s) d t | |_{2}}{| | f^{'} (s_{g (\frac{i}{n})}) | |_{2}} \leq \frac{L}{| | f^{'} (s_{g (\frac{i}{n})}) | |_{2}} (\frac{1}{2 M + 2} + \frac{| i - j |}{n}) \cdot \frac{| i - j |}{n}$ .

Therefore if we can approximate $\frac{(x_{j} - x_{i}) \cdot f^{'} (s_{g (\frac{i}{n})})}{| | f^{'} (s_{g (\frac{i}{n})}) | |_{2}^{2}}$ , then we can approximate $j - i$ .

Attention mechanism:

If we are given a two token input to the transformer $x_{j}$ , $x_{i}$ , then assuming that $d \leq d_{head}$ , two attention heads is sufficient to compute $x_{j} - x_{i}$ (have one head which outputs - $x_{i}$ , and the other $x_{j}$ ). We write $x_{j} - x_{i}$ to an orthogonal subspace from $x_{i}$ so that the MLP can cleanly access $x_{i}$ later.

MLP mechanism:

The MLP mechanism consists of $2 M + 2$ neurons, with a pair of two neurons associated with each sentry point.

For each sentry point $s_{m} \in {0, \frac{1}{M + 1}, \frac{2}{M + 1}, \dots, 1}$ , we define a neuron pair:

${Neuron}_{m}^{+} = ReLU (b_{m, 1} f (s_{m}) \cdot x_{i} + b_{m, 2} + b_{m, 3} (x_{j} - x_{i}) \cdot f^{'} (s_{m}))$

${Neuron}_{m}^{-} = ReLU (b_{m, 1} f (s_{m}) \cdot x_{i} + b_{m, 2} - b_{m, 3} (x_{j} - x_{i}) \cdot f^{'} (s_{m}))$

where $W_{m}$ , $b_{m, l}$ are tuned so that $b_{m, 1} f (s_{m}) \cdot x_{i} + b_{m, 2} > ϵ$ when $m = g (\frac{i}{n})$ , and $b_{m, 1} f (s_{m}) \cdot x_{i} + b_{m, 2} < 0$ otherwise. Additionally we set up $b_{m, 3} (x_{j} - x_{i}) \cdot f^{'} (s_{m})$ so that it has a magnitude of less than $ϵ$ when $| i - j | \leq k$ .

Example sentry neuron activations when $x_{i} = x_{j}$ with M = 5. Each different colour corresponds to the activation of a different sentry neuron. We can pick M+1 coprime to n so that the sentries don’t vanish at $\frac{i}{n}$ for any $i$ , and so they are bounded above by some $ϵ > 0$ . A signal of magnitude $ϵ$ can be encoded in the difference between the activations of pairs of these sentry neurons.

Setting up these sentries relies on f not coming close to intersecting itself, so that the dot product with $f (s_{m})$ is only high on a connected interval.

We wrote $x_{j} - x_{i}$ in an orthogonal subspace to $x_{i}$ so the sentry neurons can all be written in the standard form $ReLU (w^{t} h_{0} + b)$ where $h_{0}$ is the residual stream post attention.

We then output $\frac{C_{m}}{b_{m, 3}}$ ( ${Neuron}_{m}^{+}$ - ${Neuron}_{m}^{-}$ ) $v$ from the $m$ th neuron pair.

Under this construction ${Neuron}_{m}^{+}$ and ${Neuron}_{m}^{-}$ always activate at the same time as each other, so the output of the $m$ th neuron pair is $0$ if $m \neq g (\frac{i}{n})$ , and $C_{m} (x_{j} - x_{i}) \cdot f^{'} (s_{m}) v$ if $m = g (\frac{i}{n})$ .

Since only a single neuron pair activates at once, the complete output of this MLP layer is $C_{g (\frac{i}{n})} (x_{j} - x_{i}) \cdot f^{'} (s_{g (\frac{i}{n})}) v$ .

Then setting $C_{g (\frac{i}{n})}$ proportional to $\frac{1}{| | f^{'} (s_{g (\frac{i}{n})}) | |_{2}^{2}}$ , we get an output proportional to $(j - i) v$

Extensions of mechanism:

The above mechanism is clean, and captures the key ideas. A nice thing about the mechanism is that the $f (s_{m}) \cdot x_{i}$ term can be noisy, but it doesn’t matter because the common noise will get cancelled out, similar to twisted pair encoding.

However, there can still be potential issues caused by noise at the transitions between sentries. It is also not robust to $f$ intersecting itself, and the number of sentries required grows with $n$ .

There are cases where this kind of two neuron interference cancellation could come in useful outside of computing distance. For example, if you want to distinguish between British and Canadian English, you could have:

${Neuron}_{canada} = ReLU (Commonwealth + ϵ (Canadian))$

${Neuron}_{british} = ReLU (Commonwealth + ϵ (British))$

And then take the difference between the two. The interference that would usually make it difficult to distinguish between the two very similar dialects gets cancelled out.

Though this is probably just PCA??

How transformers can compute distances along a curve locally.