Understanding the tensor product formulation in Transformer Circuits

I was trying to understand the tensor product formulation in transformer circuits and I had basically forgotten all I ever knew about tensor products, if I ever knew anything. This very brief post is aimed at me from Wednesday 22nd when I didn’t understand why that formulation of attention was true. It basically just gives a bit more background and includes a few more steps. I hope it will be helpful to someone else, too.

Tensor product

For understanding this, it is necessary to understand tensor products. Given two finite-dimensional vector spaces $V, W$ we can construct the tensor product space $V \otimes W$ as the span^[1] of all matrices $v \otimes w$ , where $v \in V, w \in W,$ with the property $(v \otimes w)_{i j} = v_{i} w_{j}$ ^[2]. We can equivalently define it as a vector space with basis elements $e_{i}^{V} \otimes e_{j}^{W}$ , where we used the basis elements of $V$ and $W$ respectively.

But not only can we define tensor products between vectors but also between linear maps that map from one vector space to the other (i.e. matrices!):

Given two linear maps (matrices) $A : V \to X, B : W \to Y$ we can define $A \otimes B : V \otimes W \to X \otimes Y$ , where each map simply operates on its own vector space, not interacting with the other:

$(A \otimes B) (v \otimes w) = A (v) \otimes B (w)$

For more information on the tensor product, I recommend this intuitive explanation and the Wikipedia entry.

How does this connect to the attention-only transformer?

In the “attention-only” formulation of the transformer we can write the “residual” of a fixed head as $A X W_{V} W_{O}$ , with the values weight matrix $W_{V}$ , the attention matrix $A$ , the output weight matrix $W_{O}$ , and the current embeddings at each position $X$

Let $E$ be the embedding dimension, $L$ the total context length and $D$ the dimension of the values, then we have that

$X$ is an $L \times E$ matrix,
$A$ is a $L \times L$ matrix,
$W_{V}$ is a $E \times D$ , and
$W_{O}$ is a $D \times E$ matrix

Let’s identify the participating vector spaces:

$A$ maps from the “position” space back to the “position” space, which we will call $P$ (and which is isomorphic to $R^{L}$ ). Similarly, we have the “embedding” space $E ≅ R^{E}$ and the “value” space $V ≅ R^{D}$ .

It might become clear now that we can identify $X$ with an element from $P \otimes E$ , i.e. that we can write $X = X_{i j} (e_{i}^{P} \otimes e_{j}^{E})$ .

From that lense, we can see that right-multiplying $X$ with $W_{V}$ is equivalent to multiplying with $Id \otimes W_{V}$ , which maps an element from $P \otimes E$ to an element from $P \otimes V$ , by applying $W_{V}$ to the $E$ -part of the tensor ^[3]:

$(Id \otimes W_{V}) (X) = (Id \otimes W_{V}) \sum i j X_{i j} e_{i}^{P} \otimes e_{j}^{E} = \sum i j X_{i j} e_{i}^{P} \otimes W_{V} (e_{j}^{E}) = \sum i j X_{i j} e_{i}^{P} \otimes \sum k W_{j k} e_{k}^{V} = \sum i k \sum j (X_{i j} W_{j k}) e_{i}^{P} \otimes e_{k}^{V} = \sum i k (X W_{V})_{i k} e_{i}^{P} \otimes e_{k}^{V} = X W_{V}$

Identical arguments hold for $W_{O}$ and $A$ , so that we get the formulation from the paper:

$A X W_{O} W_{V} = (A \otimes W_{O} W_{V}) \cdot X$

Note that there is nothing special about this in terms of what these matrices represent. So it seems that a takeaway message is that whenever you have a matrix product of the form $A B C$ you can re-write it as $(A \otimes C) \cdot B$ (Sorry to everyone who thought that was blatantly obvious from the get-go ;P).^[4]

↩︎
A previous edition of this post said that it was the space of all such matrices which is inaccurate. The span of a set of vectors/matrices is the space of all linear combinations of elements from that set.
↩︎
I’m limiting myself to finite-dim spaces because that’s what is relevant to the transformer circuits paper. The actual formal definition is more general/stricter but imo doesn’t add much to understanding the application in this paper
↩︎
Note that the ‘linear map’ that we use here is basically right multiplying with $W_{V}$ , so that it maps $e_{k}^{E} \mapsto W_{V}^{T} e_{k}^{E}$
↩︎
I should note that this is also what is mentioned in the paper’s introduction on tensor products, but it didn’t click with me, whereas going through the above steps did.