Understanding the tensor product formulation in Transformer Circuits

I was trying to understand the tensor product formulation in transformer circuits and I had basically forgotten all I ever knew about tensor products, if I ever knew anything. This very brief post is aimed at me from Wednesday 22nd when I didn’t understand why that formulation of attention was true. It basically just gives a bit more background and includes a few more steps. I hope it will be helpful to someone else, too.

Tensor product

For understanding this, it is necessary to understand tensor products. Given two finite-dimensional vector spaces $V, W$ we can construct the tensor product space $V \otimes W$ as the span^[1] of all matrices $v \otimes w$ , where $v \in V, w \in W,$ with the property $(v \otimes w)_{i j} = v_{i} w_{j}$ ^[2]. We can equivalently define it as a vector space with basis elements $e_{i}^{V} \otimes e_{j}^{W}$ , where we used the basis elements of $V$ and $W$ respectively.

But not only can we define tensor products between vectors but also between linear maps that map from one vector space to the other (i.e. matrices!):

Given two linear maps (matrices) $A : V \to X, B : W \to Y$ we can define $A \otimes B : V \otimes W \to X \otimes Y$ , where each map simply operates on its own vector space, not interacting with the other:

$(A \otimes B) (v \otimes w) = A (v) \otimes B (w)$

For more information on the tensor product, I recommend this intuitive explanation and the Wikipedia entry.

How does this connect to the attention-only transformer?

In the “attention-only” formulation of the transformer we can write the “residual” of a fixed head as $A X W_{V} W_{O}$ , with the values weight matrix $W_{V}$ , the attention matrix $A$ , the output weight matrix $W_{O}$ , and the current embeddings at each position $X$

Let $E$ be the embedding dimension, $L$ the total context length and $D$ the dimension of the values, then we have that

$X$ is an $L \times E$ matrix,
$A$ is a $L \times L$ matrix,
$W_{V}$ is a $E \times D$ , and
$W_{O}$ is a $D \times E$ matrix

Let’s identify the participating vector spaces:

$A$ maps from the “position” space back to the “position” space, which we will call $P$ (and which is isomorphic to $R^{L}$ ). Similarly, we have the “embedding” space $E ≅ R^{E}$ and the “value” space $V ≅ R^{D}$ .

It might become clear now that we can identify $X$ with an element from $P \otimes E$ , i.e. that we can write $X = X_{i j} (e_{i}^{P} \otimes e_{j}^{E})$ .

From that lense, we can see that right-multiplying $X$ with $W_{V}$ is equivalent to multiplying with $Id \otimes W_{V}$ , which maps an element from $P \otimes E$ to an element from $P \otimes V$ , by applying $W_{V}$ to the $E$ -part of the tensor ^[3]:

$(Id \otimes W_{V}) (X) = (Id \otimes W_{V}) \sum i j X_{i j} e_{i}^{P} \otimes e_{j}^{E} = \sum i j X_{i j} e_{i}^{P} \otimes W_{V} (e_{j}^{E}) = \sum i j X_{i j} e_{i}^{P} \otimes \sum k W_{j k} e_{k}^{V} = \sum i k \sum j (X_{i j} W_{j k}) e_{i}^{P} \otimes e_{k}^{V} = \sum i k (X W_{V})_{i k} e_{i}^{P} \otimes e_{k}^{V} = X W_{V}$

Identical arguments hold for $W_{O}$ and $A$ , so that we get the formulation from the paper:

$A X W_{O} W_{V} = (A \otimes W_{O} W_{V}) \cdot X$

Note that there is nothing special about this in terms of what these matrices represent. So it seems that a takeaway message is that whenever you have a matrix product of the form $A B C$ you can re-write it as $(A \otimes C) \cdot B$ (Sorry to everyone who thought that was blatantly obvious from the get-go ;P).^[4]

↩︎
A previous edition of this post said that it was the space of all such matrices which is inaccurate. The span of a set of vectors/matrices is the space of all linear combinations of elements from that set.
↩︎
I’m limiting myself to finite-dim spaces because that’s what is relevant to the transformer circuits paper. The actual formal definition is more general/stricter but imo doesn’t add much to understanding the application in this paper
↩︎
Note that the ‘linear map’ that we use here is basically right multiplying with $W_{V}$ , so that it maps $e_{k}^{E} \mapsto W_{V}^{T} e_{k}^{E}$
↩︎
I should note that this is also what is mentioned in the paper’s introduction on tensor products, but it didn’t click with me, whereas going through the above steps did.

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer