GPT-2′s positional embedding matrix is a helix

In the context of transformer models, the “positional embedding matrix” is the thing that encodes the meaning of positions within a prompt. For example, given the prompt:

Hello my name is Adam

the prompt would generally be broken down into tokens as follows:

['<|endoftext|>', 'Hello', ' my', ' name', ' is', ' Adam']

(For whatever reason prompts to GPT-2 generally have an <|endoftext|> token prepended to them before being fed through, to match how the model was trained.)

For this prompt, the mapping of tokens to positions would be as follows:

'<|endoftext|>': 0

'Hello': 1

' my': 2

' name': 3

' is': 4

' Adam': 5

The positional embedding maps the positions of those tokens (0, 1, 2, 3, 4, and 5) to the meanings of those positions in vectorspace. More concretely, the positional embedding matrix maps each of those five numbers to a 768-dimensional vector of floating-point numbers, and that 768-dimensional vector gets added to a different vector that represents the semantic meaning of the token. But the first vector comes directly from the positional embedding matrix, and it is the only way the transformer has of identifying where in the prompt a given token was. So we should expect that each row of the positional embedding matrix is unique. Otherwise, two different positions would be mapped to the same vector, and the transformer would have no way of knowing which of those two positions a given token was in!

There are 1024 rows in the positional embedding matrix; this is because there are 1024 possible positions in the prompt, and each possible position gets its own row.

We should also expect that the different 768-dimensional vectors live in a low-rank linear subspace, which is just a fancy way of saying a line or a plane or something like that. After all, if you were a human engineer designing a transformer from scratch, you might devote just one of the 768 entries in each vector to encode the position—for example, you might use the first entry of each vector for this purpose, and make it be a 0 if the token was in position 0, 1 if it was in position 1, 2 if it was in position 2, and so on, and then you’d have the other 767 entries to use to encode the semantic meaning of each token, without interfering with your encoding of the token’s position. Then, in vector-space, the whole positional embedding matrix would lie on a single line—the line of points where the first coordinate of each point was an integer between 0 and 1023, and every other coordinate of each point was 0. To instead use half the vector (for example) to encode the token’s position would be very wasteful—you don’t need that many entries in the vector just to encode a single integer between 0 and 1023.

The model, of course, was produced by a training algorithm, and so does something weirder than what a human would do. The 768-dimensional vectors mostly live in a low-rank linear subspace; there’s a little bit of them that lives outside that linear subspace, but a three-dimensional subspace is enough to explain 90% of the variance over vectors in gpt2-small. We can use a technique called Principal Components Analysis (PCA) to find that three-dimensional subspace. When we graph each vector in the positional embedding matrix as a single point in the three-dimensional subspace, we get a helix:

In this plot and all plots that follow, I omit position 0, because it’s always an outlier; that position is the position of the <|endoftext|> token that gets prepended to every prompt and it has its own weird stuff going on. The dark blue end of the helix is the vectors of the first positions in the prompt; the dark red end of the helix represents the vectors of the last positions in the prompts. Interestingly, we can see that the very last position, position 1023, is an outlier as well; I don’t understand why that would be.

This result is pretty consistent across the various different GPT-2 models, including the Stanford ones; here’s the equivalent plot for all 9 GPT-2′s on TransformerLens:

We get a clear helix in all but gpt2-xl (the middle-left plot). I’m not sure why that one is so different; it’s still a little helix-like, but a lot less than the 8 others. Interestingly, there’s some qualitative differences between OpenAI’s four GPT-2′s and the five GPT-2′s from Stanford; more of the variance in the positional embedding matrix can be attributed to a three-dimensional subspace in OpenAI’s models than Stanford’s; Stanford’s helices are “shakier”; and Stanford’s helices have more loops in them than OpenAI’s. I don’t know where these qualitative differences come from; presumably something about how they were trained. I know that Stanford’s were trained on OpenWebText whereas OpenAI’s were trained on a private-but-reproducible other dataset, but I don’t know of other differences.

We can see some amount of “fraying” in the OpenAI helices on the blue end; this implies that maybe the earliest vectors in the positional embedding matrix are relying on a different subspace than all the others. To check this, I tried truncating out the first 100 vectors (so just looking at the 924 vectors corresponding to positions 100 through 1023).

This makes a big difference to the OpenAI models, getting rid of all the “fraying” and substantially increasing the percent variance explained of the 3D subspace found. It makes very little difference to the Stanford models.

In contrast, if we look at a PCA over just the first 100 positions, this is what we see:

(Note that the 3D space we’re looking at isn’t the same each time; every time we pick some subset of the vectors to look at, we’re finding the “best” 3D subspace for just that set of vectors. That’s why this plot doesn’t just look like a “zoomed-in” version of the previous plots.)

We can see from these that the OpenAI models really seem to care a lot about the first few tokens and “differentiate” those tokens more by spacing them out. (Who even knows what’s going on with the Stanford models here! I’m confused by those plots.)

Some people observed parts of this helical structure over the positional embedding matrices previously; for example, this reddit post notes that many of the entries of the vectors in the pos-embed matrix taken in isolation make something that looks kind of a sine wave. And Lukas Finnevden noticed a periodic pattern to the cosine similarity between rows of the positional embedding matrix (thanks to Arthur Conmy for pointing this out!).

Both of these observations are explained by the helical structure of the positional embedding matrix.

It remains unexplained why this helical structure is the most natural way for GPT-2 to express position. Maybe something about the structure of transformers makes it very easy to encode information as circular patterns? Neel Nanda’s Grokking Modular Arithmetic found that a trained network learned to use trigonometric functions to perform modular arithmetic—but that could just be a coincidence, since modular arithmetic seems more clearly related to periodic functions than the positions of tokens in a prompt.