Toward A Mathematical Framework for Computation in Superposition

Jan 18, 2024, 9:06 PM

LW: 205 AF: 90

Interpretability (ML & AI)Superposition AI

Author order randomized. Authors contributed roughly equally — see attribution section for details.

Update as of July 2024: we have collaborated with @LawrenceC to expand section 1 of this post into an arXiv paper, which culminates in a formal proof that computation in superposition can be leveraged to emulate sparse boolean circuits of arbitrary depth in small neural networks.

What kind of document is this?

What you have in front of you is so far a rough writeup rather than a clean text. As we realized that our work is currently highly relevant to recent questions posed by interpretability researchers, we put together a lightly edited version of private notes we’ve written over the last ~4 months. If you’d be interested in writing up a cleaner version, get in touch, or just do it. We’re making these notes public before we’re done with the project because of some combination of (1) seeing others think along similar lines and wanting to make it less likely that people (including us) spend time duplicating work, (2) providing a frame which we think provides plenty of concrete immediate problems for people to independently work on^[1] (3) seeking feedback to decrease the chance we spend a bunch of time on nonsense.

1 minute summary

Superposition is a mechanism that might allow neural networks to represent the values of many more features than they have neurons, provided that those features are present sparsely in the dataset. However, until now, an understanding of how computation can be done in a compressed way directly on these stored features has been limited to a few very specific tasks (for example here). The goal of this post is to lay the groundwork for a picture of how computation in superposition can be done in general. We hope this will enable future research to build interpretability techniques for reverse engineering circuits that are manifestly in superposition.

Our main contributions are:

Formalisation of some tasks performed by MLPs and attention layers in terms of computation on boolean features stored in superposition.
A family of novel constructions which allow a single layer MLP to compute a large number of boolean functions of features entirely in superposition.
Discussion of how these constructions could be leveraged:
1. to emulate arbitrary large sparse boolean circuits entirely in superposition
2. to allow the QK circuit of an attention head to dynamically choose a boolean expression and attend to past token positions where this expression is true.
3. To explain the tentative observation that transformers may store arbitrary XORs of features
A construction which allows the QK circuit of an attention head to check for the presence of surprisingly many query-key feature pairs simultaneously in superposition, on the order of one pair per parameter^[2].

10 minute summary

Thanks to Nicholas Goldowsky-Dill for producing an early version of this summary/diagrams and generally for being instrumental in distilling this post.

Central to our analysis of MLPs is the Universal-AND (U-AND) problem:

Given $m$ input boolean features $f_{1}, f_{2}, \dots, f_{m}$ . These features are sparse, meaning on most inputs only a few features are true, and encoded as directions in the input space $R^{d}$ .
We want to compute all $(\frac{m}{2})$ possible binary conjunctions of these inputs ( $f_{1} \land f_{2}, f_{1} \land f_{3}, \dots$ ), and output them in different linear directions. Some small bounded error in these output values is tolerated.
We want to compute this in a single MLP layer ( $R R e L U (W \to x + b)$ ) with as few neurons as possible, for weight matrix $W$ with shape , bias $B$ , and ‘readoff’ matrix $R$

This problem is central to understanding computation in superposition because:

Many features that people think of are boolean in nature, and reverse engineering the circuits that are involved in constructing them consists of understanding how simpler boolean features are combined to make them. For example, in a vision model, the feature which is 1 if there is a car in the image may be computed by combing the ‘wheels at the bottom of the image’ feature AND the ‘windows at the top’ feature ^[3].
We will be focusing on the part of the network before the readoff with the matrix $R$ . In an analogous way to the toy model of superposition, we consider the first two layers to represent If we can do this task with an MLP with fewer than $(\frac{m}{2})$ neurons, then in a sense we have computed more boolean functions than we have neurons, and the values of these functions will be stored in superposition in the MLP activation space.
Any boolean function can be written as a linear combination of ANDs with different numbers of inputs. For example $XOR (A, B, C) = A + B + C - 2 A \land B - 2 A \land C - 2 B \land C + 4 A \land B \land C$
Therefore, if we can compute and linearly represent all the ANDs in superposition, then we can do so for any boolean function.

If $m = d_{0}$ (the dimension of the input space), then we can store the input features using an orthonormal basis such as the neuron basis. A naive solution in this case would be to have one neuron per pair which is active if both inputs are true and 0 otherwise. This requires $(\frac{m}{2}) = Θ (d_{0}^{2})$ neurons, and involves no superposition:

On this input $x_{1}, x_{2}$ and $x_{5}$ are true, and all other inputs are false.

We can do much better than this, computing all the pairwise ANDs up to a small error with many fewer neurons. To achieve this, we have each neuron care about a random subset of inputs, and we choose the bias such that each neuron is activated when at least two of them are on. This requires $d = Θ (polylog (d))$ neurons:

Importantly:

A modified version works even when the input features are in superposition. In this case we cannot compute all ANDs of potentially exponentially many features. Instead, we must pick up to $~ Θ (d^{2})$ logical gates to calculate at each stage.
A solution to U-AND can be generalized to compute many ANDs of more than two inputs, and therefore to compute arbitrary boolean expressions involving a small number of input variables, with surprisingly efficient asymptotic performance (superpolynomially many functions computed at once). This can be done simply by increasing the density of connections between inputs and neurons, which comes at the cost of interference terms going to zero more slowly.
It may be possible to stack multiple of these constructions in a row and therefore to emulate a large boolean circuit, in which each layer computes boolean functions on the outputs of the previous layer. However, if the interference is not carefully managed, the errors are likely to propagate and eventually become unmanageable. The details of how the errors propagate and how to mitigate this are beyond the scope of this work.
We study the performance of our constructions asymptotically in $d$ , and expect that insofar as real models implement something like them, they will likely be importantly different in order to have low error at finite $d$ .
If the ReLU is replaced by a quadratic activation function, we can provide a construction that is much more efficient in terms of computations per neuron. We suspect that this implies the existence of similarly efficient constructions with ReLU, and constructions that may perform better at finite $d$ .

Our analysis of the QK part of an attention head centers on the task of skip feature-bigram checking:

Given residual stream vectors ${\to a}_{1}, \dots, {\to a}_{T}$ (for sequence length $T$ ) storing boolean features in superposition .
Given a set $B$ of skip feature-bigrams (SFBs) which specify which keys to attend to from each query in terms of features present in the query and key. A skip feature-bigram is a pair of features such as $({\to f}_{6}, {\to f}_{13})$ , and we say that an SFB is present in a query key pair if the first feature is present in the key and the second in the query.
We want to compute an attention score which contains, in each entry, the number of SFBs in $B$ present in the query and key that correspond to that entry. To do so, we look for a suitable choice of the parameters in the weight matrix $W_{Q K}$ , a $d_{resid} \times d_{resid}$ matrix of rank $d_{head}$ . Some small bounded error is tolerated.

This framing is valuable for understanding the role played by the attention mechanism in superposed computation because:

It is a natural modification of the ‘attention head as information movement’ story that engages with the many independent features stored in residual stream vectors in parallel, rather than treating the vectors as atomic units. Each SFB can be thought of as implementing an operation corresponding to statements like ‘if feature ${\to f}_{13}$ is present in the query, then attend to keys for which feature ${\to f}_{6}$ is present’.
The stories normally given for the role played by a QK circuit can be reproduced as particular choices of $B$ . For example, consider the set of ‘identity’ skip feature-bigrams: $B_{Id} = {$ ‘if feature ${\to f}_{i}$ is present in the query, then attend to keys for which feature ${\to f}_{i}$ is also present’ $| \forall i}$ . Checking for the presence of all SFBs in $B_{Id}$ corresponds to attending to keys which are the same as the query.
There are also many sets $B$ which are most naturally thought of in terms of performing each check in $B$ individually.

A nice way to construct $W_{Q K}$ is as a sum of terms for each skip feature-bigram, each of which is a rank one matrix equal to outer product of the two feature vectors in the SFB. In the case that all feature vectors are orthogonal (no superposition) you should be thinking of something like this:

where each of the rank one matrices, when multiplied by a residual stream vector on the right and left, performs a dot product on each side:

${\to a}_{s}^{T} W_{Q K} {\to a}_{t} = \sum i ({\to a}_{s} \cdot {\to f}_{i}^{k}) ({\to f}_{i}^{q} \cdot {\to a}_{t})$

where $(f_{1}^{k}, f_{1}^{q}), \dots, (f_{| B |}^{k}, f_{| B |}^{q})$ are the feature bigrams in $B$ with feature directions $({\to f}_{i}^{k}, {\to f}_{i}^{q})$ , and ${\to a}_{s}$ is a residual stream vector at sequence position $s$ . Each of these rank one matrices contributes a value of $1$ to the value of ${\to a}_{s}^{T} W_{Q K} {\to a}_{t}$ if and only if the corresponding SFB is present. Since the matrix cannot be higher rank than $d_{head}$ , typically we can only check for up to $~ Θ (d_{head})$ SFBs this way.

In fact we can check for many more SFBs than this, if we tolerate some small error. The construction is straightforward once we think of $W_{Q K}$ as this sum of tensor products: we simply add more rank one matrices to the sum, and then approximate the sum as a rank $d_{head}$ matrix, using the SVD or even a random projection matrix $P$ . This construction can be easily generalised to the case that the residual stream stores features in superposition (provided we take care to manage the size of the interference terms) in which case $W_{Q K}$ can be thought of as being constructed like this:

$W_{Q K}$ is a sum of outer products of feature direction vectors, with the vectors on the right projected down to a $d_{head}$ -dimensional subspace

When multiplied by a residual stream vector on the right and left, this expression is ${\to a}_{s}^{T} W_{Q K} {\to a}_{t} = \sum i ({\to a}_{s} \cdot {\to f}_{i}^{k}) (P {\to f}_{i}^{q} \cdot {\to a}_{t})$

Importantly:

It turns out that the interference becomes larger than the signal when roughly one SFB has been checked for per parameter: $| B | = ~ Θ (d_{resid} d_{head})$
When there is structure to the set of SFBs that are being checked for, we can exploit this to check for even more SFBs with a single attention head.
If there is a particular linear structure to the geometric arrangement of feature vectors in the residual stream, many more SFBs can be checked for at once, but this time the story of how this happens isn’t the simplest to describe in terms of a list of SFBs. This suggests that our current description of what the QK circuit does is lacking. In fact, this example exemplifies computation performed by neural nets that we don’t think is best described by our current sparse boolean picture. It may be a good starting point for building a broader theory than we have so far that takes into account other structures.

Indeed, there are many open directions for improving our understanding of computation in superposition, and we’d be excited for others to do future research (theoretical and empirical) in this area.

Some theoretical directions include:

Fitting the OV circuit into the boolean computation picture
Studying error propagation when U-AND is applied sequentially
Finding constructions with better interference at finite $d$
Making the story of boolean computation in transformers more complete by studying things that have not been captured by our current tasks
Generalisations to continuous variables

Empirical directions include:

Training toy models to understand if NNs can learn U-AND and related tasks, and how learned algorithms differ.
Throwing existing interp techniques at NNs trained on these tasks and trying to study what we find. Which techniques can handle the superposition adequately?
Trying to find instances of computation in superposition happening in small language models.

Structure of the Post

In Section 1, we define the U-AND task precisely, and then walk through our construction and show that it solves the task. Then we generalise the construction in 2 important ways: in Section 1.1, we modify the construction to compute ANDs of input features which are stored in superposition, allowing us to stack multiple U-AND layers together to simulate a boolean circuit. In Section 1.2 we modify the construction to compute ANDs of more than 2 variables at the same time, allowing us to compute all sufficiently small^[4] boolean functions of the inputs with a single MLP. Then in Section 1.3 we explore efficiency gains from replacing the ReLU with a quadratic activation function, and explore the consequences.

In Section 2 we explore a series of questions around how to interpret the maths in Section 1, in the style of FAQs. Each part of Section 2 is standalone and can be skipped, but we think that many of the concepts discussed there are valuable and frequently misunderstood.

In section 3 we turn to the QK circuit, carefully introducing the skip feature-bigram checking task, and we explain our construction. We also discuss two scenarios that allow for more SFBs to be checked for than the simplest construction would allow.

We discuss the relevance of our constructions to real models in Section 4, and conclude in Section 5 with more discussion on Open Directions.

Notation and Conventions

$d$ is the dimension of some activation space. $d_{0}$ may also be used for the dimension of the input space, and $d$ for the number of neurons in an MLP
$m$ is the number of input features. If the input features are stored in superposition, $m > d$ , otherwise $m = d$
${\to e}_{1}, {\to e}_{2}, \dots, {\to e}_{d}$ denotes an orthogonal basis of vectors. The standard basis refers to the neuron basis.
All vectors are denoted with arrows on top like this: ${\to f}_{i}$
We use single lines to denote the size of a set like this: $| S_{i} |$ or the $L^{2}$ norm of a vector like this: $| {\to f}_{i} |$
We say that a boolean function $g$ has been computed $ϵ$ -accurately for some small parameter $ϵ$ if the computed output never differs from $g$ by more than $ϵ$ . That is, whenever the function has the output $1$ , the computation outputs a number between $1 \pm ϵ$ and whenever the function outputs $0$ , the computation outputs a number between $\pm ϵ$ .
We say that a pair of unit vectors is $ϵ$ -almost orthogonal (for a fixed parameter $ϵ$ ) if their dot product is $< ϵ$ (equivalently, if they are orthogonal to $ϵ$ -accuracy). We say that a collection of unit vectors is $ϵ$ -almost-orthogonal if they are pairwise almost orthogonal. We assume $ϵ$ to be a fixed small number throughout the paper (unless specified otherwise).
It is known that for fixed $ϵ,$ one can fit exponentially (in $d$ ) many almost orthogonal vectors in a $d$ -dimensional Euclidean space. Throughout this paper, we will assume present in each NN activation space a suitably “large” collection of almost-orthogonal vectors, which we call an overbasis.
Vectors in this overbasis will be called f-vectors^[5], and denoted ${\to f}_{1}, {\to f}_{2}, \dots, {\to f}_{m}$ . We assume they correspond to binary properties of inputs relevant to a neural net (such as “Does this picture contain a cat?”). When convenient, we will assume these f-vectors are generated in a suitably random way: it is known that a random collection of vectors is, with high probability, an almost orthogonal overbasis, so long as the number of vectors is not superexponentially large in $d$ ^[6].

In this post we make extensive use of Big-O notation and its variants, little o, $Θ, Ω, ω$ . See wikipedia for definitions. We also make use of tilde notation, which means we ignore log factors. For example, by saying a function $f (n)$ is $Θ (g (n))$ , we mean that there are nonzero constants $c_{1}, c_{2} > 0$ and a natural number $N$ such that for all $n > N$ , we have $c_{1} g (n) \leq f (n) \leq c_{2} g (n)$ . By saying a quantity is $~ Θ (f (d))$ , we mean that this is true up to a factor that is a polynomial of $log d$ — i.e., that it is asymptotically between $f (d) / polylog (d)$ and $f (d) polylog (d)$ .

1 The Universal AND

We introduce a simple and central component in our framework, which we call the Universal AND component or U-AND for short. We start by introducing the most basic version of the problem this component solves. We then provide our solution to the simplest version of this problem. We later discuss a few generalizations: to inputs which store features in superposition, and to higher numbers of inputs to each AND gate. More elaboration on U-AND — in particular, addressing why we think it’s a good question to ask — is provided in Section 2.

1.1 The U-AND task

The basic boolean Universal AND problem: Given an input vector which stores an orthogonal set of boolean features, compute a vector from which can be linearly read off the value of every pairwise AND of input features, up to a small error. You are allowed to use only a single-layer MLP and the challenge is to make this MLP as narrow as possible.

More precisely: Fix a small parameter $ϵ > 0$ and let $d_{0}$ and $ℓ$ be integers with $d_{0} \geq ℓ$ ^[7]. Let ${\to e}_{1}, \dots, {\to e}_{d_{0}}$ be the standard basis in $R^{d_{0}}$ , i.e. ${\to e}_{i}$ is the vector whose $i$ th component is $1$ and whose other components are $0$ . Inputs are all at most $ℓ$ -composite vectors, i.e., for each index set $I \subseteq [d]$ with $| I | \leq ℓ$ , we have the input ${\to x}_{I} = \sum_{i \in I} {\to e}_{i} \in R^{d_{0}}$ . So, our inputs are in bijection with binary strings that contain at most $ℓ$ ones^[8]. Our task is to compute all $(\frac{d_{0}}{2})$ pairwise ANDs of these input bits, where the notion of ‘computing’ a property is that of making it linearly represented in the output activation vector $\to a (\to x) \in R^{d}$ . That is, for each pair of inputs $i, j$ , there should be a linear function $r_{i, j} : R^{d} \to R$ , or more concretely, a vector ${\to r}_{i, j} \in R^{d}$ , such that ${\to r}_{i, j}^{T} \to a (x) \approx_{ϵ} {AND}_{i, j} (x)$ . Here, the $\approx_{ϵ}$ indicates equality up to an additive error $ϵ$ and ${AND}_{i, j}$ is $1$ iff both bits $i$ and $j$ of $x$ are 1. We will drop the subscript $ϵ$ going forward.

We will provide a construction that computes these $Θ (d_{0}^{2})$ features with a single $d$ -neuron ReLU layer, i.e., a $d_{0} \times d$ matrix $W$ and a vector $\to b \in R^{d}$ such that $\to a (x) = ReLU (W \to x + \to b)$ , with $d ≪ d_{0}$ . Stacking the readoff vectors ${\to r}_{i, j}$ we provide as the rows of a readout matrix $R$ , you can also see us as providing a parameter setting solving $- --- \to ANDs (\to x) \approx_{ϵ} R (ReLU (W \to x + \to b))$ , where $- --- \to ANDs (\to x)$ denotes the vector of all $(\frac{d_{0}}{2})$ pairwise ANDs. But we’d like to stress that we don’t claim there is ever something like this large, size $(\frac{d_{0}}{2})$ , layer present in any practical neural net we are trying to model. Instead, these features would be read in by another future model component, like how the components we present below (in particular, our U-AND construction with inputs in superposition and our QK circuit) do.

There is another kind of notion of a set of features having been computed, perhaps one that’s more native to the superposition picture: that of the activation vector (approximately) being a linear combination of f-vectors — we call these vectors f-vectors— corresponding to these properties, with coefficients that are functions of the values of the features. We can also consider a version of the U-AND problem that asks for output vectors which represent the set of all pairwise ANDs in this sense, maybe with the additional requirement that the f-vectors be almost orthogonal. Our U-AND construction solves this problem, too — it computes all pairwise ANDs in both senses. See the appendix for a discussion of some aspects of how the linear readoff notion of stuff having been computed, the linear combination notion of something having been computed, and almost orthogonality hang together.

1.2 The U-AND construction

We now present a solution to the U-AND task, computing $(\frac{d_{0}}{2})$ new features with an MLP width that can be much smaller than $(\frac{d_{0}}{2})$ . We will go on to show how our solution can be tweaked to compute ANDs of more than 2 features at a time, and to compute ANDs of features which are stored in superposition in the inputs.

To solve the base problem, we present a random construction: $W$ (with shape $d_{0} \times d$ ) has entries that are iid random variables which are $1$ with probability $p (d) ≪ 1$ , and each entry in the bias vector is $- 1$ . We will pin down what $p$ should be later.

We will denote by $S_{i}$ the set of neurons that are ‘connected’ to the $i$ th input, in the sense that elements of the set are neurons for which the $i$ th entry of the row of the weight vector that connects to that neuron is $1$ . $_{i}$ is used to denote the indicator set of $S_{i}$ : the vector which is $1$ for every neuron in $S_{i}$ and $0$ otherwise. So $_{i}$ is also the $i$ th column of $W$ .

Then we claim that for this choice of weight matrix, all the ANDs are approximately linearly represented in the MLP activation space with readoff vectors (and feature vectors, in the sense of Appendix B) given by

$v_{(x_{i} \land x_{j})} = v_{i j} = \frac{- ---- \to S_{i} \cap S_{j}}{| S_{i} \cap S_{j} |}$

for all $i, j$ , where we continue our abuse of notation to write $S_{i} \cap S_{j}$ as shorthand for the vector which is an indicator for the intersection set, and $| S_{i} \cap S_{j} |$ is the size of the set.

We preface our explanation of why this works with a technical note. We are going to choose $d$ and $p$ (as functions of $d_{0}$ ) so that with high probability, all sets we talk about have size close to their expectation. To do this formally, one first shows that the probability of each individual set having size far from its expectation is smaller than any $1 / poly (d_{0})$ using the Chernoff bound (Theorem 4 here), and one follows this by a union bound over all only $poly (d_{0})$ sets to say that with probability $1 - o (1)$ , none of these events happen. For instance, if a set $S_{i} \cap S_{j}$ has expected size ${log}^{4} d_{0}$ , then the probability its size is outside of the range ${log}^{4} d_{0} \pm {log}^{3} d_{0}$ is at most $2 e^{- μ δ^{2} / 3} = 2 e^{- {log}^{2} d_{0}} = 2 e d_{0}^{- log d_{0}}$ (following these notes, we let $μ$ denote the expectation and $δ$ denote the number of $μ$ -sized deviations from the expectation — this bound works for $δ < 1$ which is the case here). Technically, before each construction to follow, we should list our parameters $d, p$ and all the sets we care about (for this first construction, these are the double and triple intersections between the $S_{i}$ ) and then argue as described above that with high probability, they all have sizes that only deviate by a factor of $1 + o (1)$ from their expected size and always carry these error terms around in everything we say, but we will omit all this in the rest of the U-AND section.

So, ignoring this technicality, let’s argue that the construction above indeed solves the U-AND problem (with high probability). First, note that $| S_{i} \cap S_{j} | \sim Bin (d, p^{2})$ . We require that $p$ is big enough to ensure that all intersection sets are non-empty with high probability, but subject to that constraint we probably want $p$ to be as small as possible to minimise interference^[9]. We’ll choose $p = {log}^{2} d_{0} / \sqrt{d}$ , such that the intersection sets have size $| S_{i} \cap S_{j} | \approx {log}^{4} d_{0}$ . We split the check that the readoff works out into a few cases:

Firstly, if input features $i$ , $j$ , and at most $ℓ - 2$ other input features are present (recall that we are working with $ℓ$ -composite inputs), then letting $\to a$ denote the post-ReLU activation vector, we have ${\to f}_{{AND}_{i j}} \cdot \to a = 1$ plus an error that is at most $ℓ$ times [the sum of sizes of triple intersections involving $i, j$ and each of the $k - 2$ other features which are on, divided by the size of the $S_{i} \cap S_{j}$ ]. This is very likely less than $O (1 / {log}^{2} d_{0})$ for all polynomially many pairs and sets of $ℓ - 2$ other inputs at once^[10], at least assuming $d = ω ({log}^{8} d_{0})$ . The expected value of this error is ${log}^{2} d_{0} / \sqrt{d}$ .
Secondly, if only one of $i, j$ is present together with some at most $ℓ - 1$ other features, then we get nonzero terms in the sum that expanding the dot product ${\to f}_{{AND}_{i j}} \cdot \to a$ precisely for neurons in a triple intersection of $i, j$ , and one of the $ℓ - 1$ other features, so the readoff $\approx 0$ — more precisely, $O (1 / {log}^{2} d_{0})$ (again, assuming $d = ω ({log}^{8} d_{0})$ ), and $\frac{{log}^{2} d_{0}}{\sqrt{d}}$ in expectation).
Finally, if neither of $i, j$ is present, then the error corresponds to quadruple intersections, so it is even more likely at most $O (1 / {log}^{2} d_{0})$ (still assuming $d = ω ({log}^{8} d_{0})$ ), and $\frac{{log}^{4} d_{0}}{d}$ in expectation.

So we see that this readoff is indeed the AND of $i$ and $j$ up to error $ϵ = O (1 / {log}^{2} d_{0})$ .

To finish, we note without much proof that everything is also computed in the sense that ‘the activation vector is a linear combination of almost orthogonal features’ (defined in Appendix B). The activation vector being an approximate linear combination of pairwise intersection indicator vectors with coefficients being given by the ANDs follows from triple intersections being small, as does the almost-orthogonality of these feature vectors.

U-AND allows for arbitrary XORs to be efficiently calculated

A consequence of the precise (up to $ϵ$ ) nature of our universal AND is the existence of a universal XOR, in the sense of every XOR of features being computed. In this post by Sam Marks, it is tentatively observed that real-life transformers linearly compute XOR of arbitrary features in the weak sense of being able to read off tokens where XOR of two tokens is true using a linear probe (not necessarily with $ϵ$ accuracy). This weak readoff behavior for AND would be unsurprising, as the residual stream already has this property (using the readoff vector ${\to f}_{i} + {\to f}_{j}$ which has maximal value if and only if $f_{i}$ and $f_{j}$ are both present). However, as Sam Marks observes, it is not possible to read off XOR in this weak way from the residual stream. We can however see that such a universal XOR (indeed, in the strong sense of $ϵ$ -accuracy) can be constructed from our strong (i.e., $ϵ$ -accurate) universal AND. To do so, assume that in addition to the residual stream containing feature vectors ${\to f}_{i}$ and ${\to f}_{j}$ , we’ve also already almost orthogonally computed universal AND features ${\to f}_{{AND}_{i, j}}$ into the residual stream. Then we can weakly (and in fact, $ϵ$ -accurately) read off XOR from this space by taking the dot product with the vector ${\to f}_{{XOR}_{i, j}} := {\to f}_{i} + {\to f}_{j} - 2 {\to f}_{{AND}_{i, j}}$ . Then we see that if we had started with the two-hot pair ${\to f}_{i^{'}} + {\to f}_{j^{'}},$ the result of this readoff will be, up to a small error $O (ϵ),$

$⎧ ⎨ ⎩ \begin{matrix} 0 = 0 - 0, & | {i, j} \cap {i^{'}, j^{'}} | = 0 & (neither coefficient agrees) 1 = 1 - 0, & | {i, j} \cap {i^{'}, j^{'}} | = 1 & (one coefficient agrees) 0 = 2 - 2, & {i, j} = {i^{'}, j^{'}} & (both coefficients agree) \end{matrix}$

This gives a theoretical feasibility proof of an efficiently computable universal XOR circuit, something Sam Marks believed to be impossible.

1.3 Handling inputs in superposition: sparse boolean computers

Any boolean circuit can be written as a sequence of layers executing pairwise ANDs and XORs^[11] on the binary entries of a memory vector. Since our U-AND can be used to compute any pairwise ANDs or XORs of features, this suggests that we might be able to emulate any boolean circuit by applying something like U-AND repeatedly. However, since the outputs of U-AND store features in superposition, if we want to pass these outputs as inputs to a subsequent U-AND circuit, we need to work out the details of a U-AND construction that can take in features in superposition. In this section we explore the subtleties of modifying U-AND in this way. In so doing, we construct an example of a circuit which acts entirely in superposition from start to finish — nowhere in the construction are there as many dimensions as features! We consider this to be an interesting result in its own right.

U-ANDs ability to compute many boolean functions of inputs features stored in superposition provides an efficient way to use all the parameters of the neural net to compute (up to a small error) a boolean circuit with a memory vector that is wider than the layers of the NN^[12]. We call this emulating a ‘boolean computer’. However, three limitations prevent any boolean circuit from being computed:

An injudicious choice of a layer executing XORs applied to a sparse input can fail to give a sparse output vector. Since U-AND only works on inputs with sparse features, this means that we can only emulate circuits with the property than on sparse inputs, their memory vector is sparse throughout the computation. We call these circuits ‘sparse boolean circuits’.
Even if the outputs of the circuit remain sparse at every layer, the $ϵ$ errors involved in the boolean read-offs compound from layer to layer. We hope that it is possible to manage this interference (perhaps via subtle modifications to the constructions) enough to allow multiple steps of sequential computation, although we leave an exploration of error propagation to future work.
We can’t compute an unbounded number of new features with a finite-dimensional hidden layer. As we will see in this section, when input features are stored in superposition (which is true for outputs of U-AND and therefore certainly true for all but possibly the first layer of an emulated boolean circuit), we cannot compute more than $~ Θ (d_{0} d)$ (number of parameters in the layer) many new boolean functions at a time.

Therefore, the boolean circuits we expect can be emulated in superposition (1) are sparse circuits (2) have few layers (3) have memory vectors which are not larger than the square of the activation space dimension.

Construction details for inputs in superposition

Now we generalize U-AND to the case where input features can be in superposition. With f-vectors ${\to f}_{1}, \dots, {\to f}_{m} \in R^{d_{0}}$ , we give each feature a random set of neurons to map to, as before. After coming up with such an assignment, we set the $i$ th row of $W$ to be the sum of the f-vectors for features which map to the $i$ th neuron. In other words, let $F$ be the $m \times d_{0}$ matrix with $i$ th row given by the components of ${\to f}_{i}$ in the neuron basis:

$F = ⎛ ⎜ ⎜ ⎜ ⎝ \begin{matrix} {\to f}_{1} \to ⋮ {\to f}_{m} \to \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎠$

Now let \hat{W} be a sparse matrix (with shape $d \times m$ ) with entries that are iid Bernoulli random variables which are $1$ with probability $p (d) ≪ 1$ . Then:

$W =^W F$

Unfortunately, since the ${\to f}_{1}, \dots, {\to f}_{m}$ are random vectors, their inner product will have a typical size of $1 / \sqrt{d_{0}}$ . So, on an input which has no features connected to neuron $i$ , the preactivation for that neuron will not be zero: it will be a sum of these interference terms, one for each feature that is connected to the neuron. Since the interference terms are uncorrelated and mean zero, they start to cause neurons to fire incorrectly when $Θ (d_{0})$ neurons are connected to each neuron. Since each feature is connected to each neuron with probability $p = \frac{{log}^{2} d_{0}}{\sqrt{d}})$ this means neurons start to misfire when $m = ~ Θ (d_{0} \sqrt{d})$ ^[13]. At this point, the number of pairwise ANDs we have computed is $(\frac{m}{2}) = ~ Θ (d_{0}^{2} d)$ .

This is a problem, if we want to be able to do computation on input vectors storing potentially exponentially many features in superposition, or even if we want to be able to do any sequential boolean computation at all:

Consider an MLP with several layers, all of width $d_{MLP}$ , and assume that each layer is doing a U-AND on the features of the previous layer. Then if the features start without superposition, there are initially $d_{MLP}$ features. After the first U-AND, we have $Θ (d_{MLP}^{2})$ new features, which is already too many to do a second U-AND on these features!

Therefore, we will have to modify our goal when features are in superposition. That said, we’re not completely sure there isn’t any modification of the construction that bypasses such small polynomial bounds. But e.g. one can’t just naively make $^W$ sparser — $p$ can’t be taken below $d^{- 1 / 2}$ without the intersection sets like $| S_{i} \cap S_{j} |$ becoming empty. When features were not stored in superposition, solving U-AND corresponded to computing $d_{0}^{2}$ many new features. Instead of trying to compute all pairwise ANDs of all (potentially exponentially many) input features in superposition, perhaps we should try to compute a reasonably sized subset of these ANDs. In the next section we do just that.

A construction which computes a subset of ANDs of inputs in superposition

Here, we give a way to compute ANDs of up to $d_{0} d$ particular feature pairs (rather than all $(\frac{m}{2})$ ANDs) that works even for $m$ that is superpolynomial in $d_{0}$ ^[14]. (We’ll be ignoring $log$ factors in much of what follows.)

In U-AND, we take $^W$ to be a random matrix with iid ⁰⁄₁ entries with probability $p = \frac{{log}^{2} d_{0}}{\sqrt{d}}$ . If we only need/want to compute a subset of all the pairwise ANDs — let $E$ be this set of all pairs of inputs ${i, j}$ for which we want to compute the AND of $i$ and $j$ — then whenever ${i, j} \in E$ , we might want each pair of corresponding entries in the corresponding columns $i$ and $j$ of the adjacency matrix $^W$ , i.e., each pair ${(^W)}_{k i}$ , ${(^W)}_{k j}$ to be a bit more correlated than an analogous pair in column $i^{'}$ and $j^{'}$ with ${i^{'}, j^{'}} \notin E$ . Or more precisely, we want to make such pairs of columns ${i, j}$ have a surprisingly large intersection for the general density of the matrix — this is to make sure that we get some neurons which we can use to read off the AND of ${i, j}$ , while choosing the general density in $^W$ to be low enough that we don’t cross the density threshold at which a neuron needs to care about too many input features.

One way to do this is to pick a uniformly random set of ${log}^{4} d_{0}$ neurons for each ${i, j} \in E$ , and to set the column of $^W$ corresponding to input $i$ to be the indicator vector of the union of these sets (i.e., just those assigned to gates involving $i$ ). This way, we can compute up to around $| E | = ~ Θ (d_{0} d)$ pairwise ANDs without having any neuron care about more than $d_{0}$ input features, which is the requirement from the previous section to prevent neurons misfiring when input f-vectors are random vectors in superposition with typical interference size $Θ (1 / \sqrt{d_{0}})$ .

1.4 ANDs with many inputs: computation of small boolean circuits in a single layer

It is known that any boolean circuit with $k$ inputs can be written as a linear combination (with possibly exponential in $k$ terms, which is a substantial caveat) ANDs with up to $k$ inputs (fan-in up to $k$ )^[15]. This means that, if we can compute not just pairwise ANDs, but ANDs of all fan-ins up to $k$ , then we can write down a ‘universal’ computation that computes (simultaneously, in a linearly-readable sense) all possible circuits that depend on some up to $k$ inputs.

The U-AND construction for higher fan-in

We will modify the standard, non-superpositional U-AND construction to allow us to compute all ANDs of a specific fan-in $k$ .

We’ll need two modifications:

We’re now interested in $k$ -wise intersections between the $S_{i}$ . The size of these intersections is smaller than double intersections, so we need to increase $p$ to guarantee they are nonempty. A sensible choice for fan-in $k$ is $p = \frac{{log}^{2} d_{0}}{d^{1 / k}}$ .
We only want neurons to fire when $k$ of the features that connect to them are present at the same time, so we require the bias to be $- k + 1$ .

Now we read off the AND of a set $I$ of input features along the vector $⋂_{i \in I} S_{i}$ .

We can straightforwardly simultaneously compute all ANDs of fan-ins ranging from $2$ to $k$ by just evenly partitioning the $d$ neurons into $k - 1$ groups — let’s label these $2, 3, \dots, k$ — and setting the weights into group $i$ and the biases of group $i$ as in the fan-in $i$ U-AND construction.

A clever choice of density can give us all the fan-ins at once

Actually, we can calculate all ANDs of up to some constant fan-in $k$ in a way that feels more symmetric than the option involving a partition above^[16] by reusing the fan-in $2$ U-AND with (let’s say) $d = d_{0}$ and a careful choice of $p = \frac{1}{{log}^{2} d_{0}}$ . This choice of $p$ is larger than $\frac{{log}^{2} d_{0}}{d^{1 / k}}$ for any $k$ , ensuring that every intersection set is non-empty. Then, one can read off ${AND}_{i, j}$ from $S_{i} \cap S_{j}$ as usual, but one can also read off ${AND}_{i, j, k}$ with the composite vector

$- \frac{S_{i} \cap S_{j} \cap S_{k}}{| S_{i} \cap S_{j} \cap S_{k} |} + \frac{S_{i} \cap S_{j}}{| S_{i} \cap S_{j} |} + \frac{S_{i} \cap S_{k}}{| S_{i} \cap S_{k} |} + \frac{S_{j} \cap S_{k}}{| S_{j} \cap S_{k} |}$ In general, one can read off the AND of an index set $I$ with the vector $\sum I^{'} \subseteq I s.t. | I^{'} | \geq 2 (- 1)^{| I | - | I^{'} | + 1} v_{I^{'}}$ where $v_{I^{'}} = \frac{⋂_{i \in I^{'}} S_{i}}{∣ ∣ ⋂_{i \in I^{'}} S_{i} ∣ ∣}$ One can show that this inclusion-exclusion style formula works by noting that if the subset of indices of $I$ which are on is $J$ , then the readoff will be approximately $\sum I^{'} \subseteq I s.t. | I^{'} | \geq 2 (- 1)^{| I | - | I^{'} | + 1} max (0, | I^{'} \cap J | - 1)$ . We’ll leave it as an exercise to show that this is $0$ if $J \neq I$ and $1$ if $J = I$ .

Extending the targeted superpositional AND to other fan-ins

It is also fairly straightforward to extend the construction for a subset of ANDs when inputs are in superposition to other fan-ins, doing all fan-ins on a common set of neurons. Instead of picking a set for each pair that we need to AND as above, we now pick a set for each larger AND gate that we care about. As in the previous sparse U-AND, each input feature gets sent to the union of the sets for its gates, but this time, we make the weights depend on the fan-in. Letting $K$ denote the max fan-in over all gates, for a fan-in $k$ gate, we set the weight from each input to $K / k$ , and set the bias to $- K + 1$ . This way, still with at most about $~ Θ (d^{2})$ gates, and at least assuming inputs have at most some constant number of features active, we can read the output of a gate off with the indicator vector of its set.

1.5 Improved Efficiency with a Quadratic Nonlinearity

It turns out that, if we use quadratic activation functions $x \mapsto x^{2}$ instead of ReLU’s $x \mapsto ReLU (x),$ we can write down a much more efficient universal AND construction. Indeed, the ReLU universal AND we constructed can compute the universal AND of up to $~ Θ (d^{3 / 2})$ features in a $d$ -dimensional residual stream. However, in this section we will show that with a quadratic activation, for $ℓ$ -composite vectors, we can compute all pairwise ANDs of up to $m = Ω (exp (\frac{1}{2 ℓ} ϵ^{2} \sqrt{d}))$ ^[17] features stored in superposition (this is exponential in $\sqrt{d}$ , so superpolynomial in $d$ (!)) that admit a single-layer universal AND circuit.

The idea of the construction is that, on the large space of features $R^{m},$ the AND of the boolean-valued feature variables $f_{i}, f_{j}$ can be written as a quadratic function $q_{i, j} : {0, 1}^{m} \mapsto {0, 1}$ ; explicitly, $q_{i, j} (f_{1}, \dots, f_{m}) = f_{i} \cdot f_{j} .$ Now if we embed feature space $R^{m}$ onto a smaller $R^{r}$ in an $ϵ$ -almost-orthogonal way, it is possible to show that the quadratic function $q_{i, j}$ on $R^{m}$ is well-approximated on sparse vectors by a quadratic function on $R^{r}$ (with error bounded above by $2 ϵ$ on 2-sparse inputs in particular). Now the advantage of using quadratic functions is that any quadratic function on $R^{r}$ can be expressed as a linear read-off of a special quadratic function $Q : R^{r} \to R^{r^{2}}$ given by the composition of a linear function $R^{r} \to R^{r^{2}}$ and a quadratic element-wise activation function on $R^{r^{2}}$ which creates a set of neurons which collectively form a basis for all quadratic functions. Now we can set $d = r^{2}$ to be the dimension of the residual stream and work with an $r$ -dimensional subspace $V$ of the residual stream, taking the almost-orthogonal embedding $R^{m} \to V$ . Then the map $V Q \to R^{d}$ provides the requisite universal AND construction. We make this recipe precise in the following section

Construction Details

In this section we use slightly different notation to the rest of the post, dropping overarrows for vectors, and we drop the distinction between features and f-vectors.

Let $V = R^{r}$ be as above. There is a finite-dimensional space of quadratic functions on $R^{r}$ , with basis $q_{i j} = x_{i} x_{j}$ of size $r^{2}$ (such that we can write every quadratic function as a linear combination of these basis functions); alternatively, we can write $q_{i j} (v) = (v \cdot e_{i}) (v \cdot e_{j}),$ for $e_{i}, e_{j}$ the basis vectors. We note that this space is spanned by a set of functions which are squares of linear functions of ${x_{i}}$ :

$\begin{matrix} L_{i}^{(1)} (x_{1}, \dots, x_{r}) & = x_{i} L_{i, j}^{(2)} (x_{1}, \dots, x_{r}) & = x_{i} + x_{j} L_{i, j}^{(3)} (x_{1}, \dots, x_{r}) & = x_{i} - x_{j} \end{matrix}$

The squares of these functions are a valid basis for the space of quadratic functions on $R^{r}$ since $q_{i i} = (L_{i}^{(1)})^{2}$ and for $i \neq j,$ we have $q_{i j} = \frac{(L_{i, j}^{(2)})^{2} - (L_{i, j}^{(3)})^{2}}{4} .$ There are $m$ distinct functions of type $(1)$ , and $(\frac{m}{2})$ functions each of type $(2)$ and $(3)$ , for a total of $r^{2}$ basis functions as before. Thus there exists a single-layer quadratic-activation neural net $Q : x \mapsto y$ from $R^{r} \to R^{r^{2}}$ such that any quadratic function on $R^{r}$ is realizable as a “linear read-off”, i.e., given by composing $Q$ with a linear function $R^{r^{2}} \to R .$ In particular, we have linear “read-off” functions $Λ_{i j} : R^{r^{2}} \to R$ such that $L_{i j} (Q (x)) = q_{i j} (x) .$

Now suppose that $f_{1}, \dots, f_{m}$ is a collection of f-vectors which are $ϵ$ -almost-orthogonal, i.e., such that $| f_{i} | = 1$ for any $i$ and $| f_{i} \cdot f_{j} | < ϵ \forall i < j \leq m .$ Note that (for fixed $ϵ < 1$ ), there exist such collections with exponential (in $r$ ) number of vectors $m$ . We can define a new collection of symmetric bilinear functions (i.e., functions in two vectors $v, w \in R^{n}$ which are linear in each input independently and symmetric to switching $v, w$ ), $ϕ_{i, j},$ for a pair of (not necessarily distinct) indices $0 < i \leq j \leq m,$ defined by $ϕ_{i, j} (v) = (v \cdot f_{i}) (v \cdot f_{j})$ (this is a product of two linear functions, hence quadratic). We will use the following result:

Proposition 1 Suppose $ϕ_{i, j}$ is as above and $0 < i^{'} \leq j^{'} < m$ is another pair of (not necessarily distinct) indices associated to feature vectors $v_{i}, v_{j} .$ Then

$ϕ_{i, j} (v_{i^{'}}, v_{j^{'}}) ⎧ ⎨ ⎩ \begin{matrix} = 1, & i = i^{'} and j = j^{'} \in (- ϵ, ϵ), & (i, j) \neq (i^{'}, j^{'}) \in (- ϵ^{2}, ϵ^{2}), & {i, j} \cap {i^{'}, j^{'}} = \emptyset (i.e., no indices in common) \end{matrix}$

This proposition follows immediately from the definition of $ϕ_{k, ℓ}$ and the almost orthogonality property. $□$

Now define the single-valued quadratic function $ϕ_{i, j}^{single} (v) := \frac{1}{2} ϕ_{i, j} (v, v),$ by applying the bilinear form to two copies of the same vector and dividing by 2. Then the proposition above implies that, for two pairs of distinct indices $0 < i < j \leq m$ and $0 < i^{'} < j^{'} \leq m$ we have the following behavior on the sum of two features (the superpositional analog of a two-hot vector):

$ϕ_{i, j}^{single} (v_{i^{'}} + v_{j^{'}}) = \frac{ϕ_{i, j} (v_{i^{'}}, v_{i^{'}}) + 2 ϕ_{i, j} (v_{i^{'}}, v_{j^{'}}) + ϕ_{i, j} (v_{j^{'}}, v_{j^{'}})}{2} = ϕ_{i, j} (v_{i^{'}}, v_{j^{'}}) + O (ϵ) .$

The first formula follows from bilinearity (which is equivalent to the statement that the two entries in $ϕ_{i, j}$ behave distributively) and the last formula follows from the proposition since we assumed $(i, j)$ are distinct indices, hence cannot match up with a pair of identical indices $(i^{'}, i^{'})$ or $(j^{'}, j^{'}) .$ Moreover, $O (ϵ)$ term in the formula above is bounded in absolute value by $\frac{2 ϵ}{2} = ϵ$ .

Combining this formula with Proposition 1, we deduce:

Proposition 2

$ϕ_{i, j}^{single} (v_{i^{'}} + v_{j^{'}}) = ⎧ ⎨ ⎩ \begin{matrix} 1 + O (ϵ), & i = i^{'} and j = j^{'} O (ϵ), & (i, j) \neq (i^{'}, j^{'}) O (ϵ^{2}), & i \neq i^{'} . \end{matrix}$

Moreover, by the triangle inequality, the linear constants inherent in the $O (. . .)$ notation are $\leq 2.$ $□$

Corollary $ϕ_{i, j} (v_{i^{'}} + v_{j^{'}}) = δ_{(i, j), (i^{'}, j^{'})} + O (ϵ),$ where the $δ$ notation returns $1$ when the two pairs of indices are equal and $0$ otherwise.

We can now write down the universal AND function by setting $d = r^{2}$ above. Assume we have $m < exp (\frac{ϵ^{2}}{2} r) .$ This guarantees (with probability approaching $1$ ) that $m$ random vectors in $V ≅ R^{r}$ are ( $ϵ$ -)almost orthogonal, i.e., have dot products $< ϵ .$ We assume the vectors $v_{1}, \dots, v_{m}$ are initially embedded in $V \subset R^{d} .$ (Note that we can instead assume they were initially randomly embedded in $R^{d},$ then re-embedded in $R^{r}$ by applying a random projection and rescaling appropriately.) Let $Q : R^{r} \to R^{d = r^{2}}$ be the universal quadratic map as above; we let $q_{i j} : R^{d} \to R$ be the quadratic functions as above. Now we claim that $Q$ is a universal AND with respect to the feature vectors $v_{1}, \dots, v_{N} .$ Note that, since the function $ϕ_{i, j}^{single} (v)$ is quadratic on $R^{r},$ it can be factorized as $ϕ_{i, j}^{single} (x) = Φ_{i, j} (Q (x)),$ for $Φ_{i, j}$ some linear function on $R^{r^{2}}$ ^[18]. We now see that the linear maps $Φ_{i, j}$ are valid linear read-offs for ANDs of features: indeed,

$Φ_{i, j} (Q (v_{i^{'}} + v_{j^{'}})) = ϕ_{i, j}^{single} (v_{i^{'}}, v_{j^{'}}) = δ_{(i, j), (i^{'}, j^{'})} + O (ϵ) = AND (b_{i}^{i^{'}, j^{'}}, b_{j}^{i^{'}, j^{'}}),$

where $b^{i^{'}, j^{'}}$ is the two-hot boolean indicator vector with $1$ s in positions $i^{'}$ and $j^{'}$ . Thus the AND of any two indices $i, j$ can be computed via the readout linear function $Φ_{i, j}$ on any two-hot input $b^{i^{'}, j^{'}}$ . Moreover, applying the same argument to a larger sparse sum gives $Φ_{i, j} (Q (\sum_{k = 1}^{m} b_{k} v_{k})) = AND (b_{i}, b_{j}) + O (s^{2} ϵ),$ where $s = \sum_{k = 1}^{m} b_{k}$ is the sparsity^[19].

Scaling and comparison with ReLU activations

It is surprising that the universal AND circuit we wrote down for quadratic activations is so much more expressive than the one we have for ReLU activations, since the conventional wisdom for neural nets is that the expressivity of different (suitably smooth) activation functions does not increase significantly when we replace arbitrary activations by quadratic ones. We do not know if this is a genuine advantage of quadratic activations over others (and indeed might be implemented in transformers in some sophisticated way involving attention nonlinearities), or whether there is some yet-unknown reason that (perhaps assuming nice properties of our features), ReLU’s can give more expressive universal AND circuits than we have been able to find in the present work. We list this discrepancy as an interesting open problem that follows from our work.

Generalizations

Note that the nonlinear function $Q$ above lets us read off not only the AND of two sparse boolean vectors, but more generally the sum of products of coordinates of any sufficiently sparse linear combination of feature vectors $v_{i}$ (not necessarily boolean). More generally, if we replace quadratic activations with cubic or higher, we can get cubic expressions, such as the sum of triple ANDs (or, more generally, products of triples of coordinates). A similar effect can be obtained by chaining $l$ sequential levels of quadratic activations to get polynomial nonlinearities with exponent $e = 2^{l}$ . Then so long as we can fit $O (r^{e})$ ^[20] features in the residual stream in an almost-orthogonal way (corresponding to a basis of monomials of degree $d$ on $r$ -dimensional space), we can compute sums of any degree- $e$ monomial over features, and thus any boolean circuit of degree $e$ , up to $O (ϵ),$ where the linear constant implicit in the $O$ depends on the exponent $e$ . This implies that for any value $e,$ there is a dimension $d$ universal nonlinear map $R^{d} \to R^{d}$ with $⌈ {log}_{2} (e) ⌉$ quadratic activations such that any sparse boolean circuit involving $\leq e$ elements is linearly represented (via an appropriate readoff vector). Moreover, keeping $e$ fixed, $d$ grows only as $O (log (n))^{e} .$ However, the constant associated with the big-O notation might grow quite quickly as the exponent $e$ increases. It would be interesting to analyse this scaling behavior more carefully, but that is outside the scope of the present work.

1.6 Universal Keys: an application of parallel boolean computation

So far, we have used our universal boolean computation picture to show that superpositional computation in a fully-connected neural network can be more efficient (specifically, compute roughly as many logical gates as there are parameters rather than non-superpositional implementations, which are bounded by number of neurons). This does not fully use the universality of our constructions: i.e., we must at every step read a polynomial (at most quadratic) number of features from a vector which can (in either the fan-in-k or quadratic-activation contexts) compute a superpolynomial number of boolean circuits. At the same time, there is a context in transformers where precisely this universality can give a remarkable (specifically, superpolynomial in certain asymptotics) efficiency improvement. Namely, recall that the attention mechanism of a transformer can be understood as a way for the last-token residual stream to read information from past tokens which pass a certain test associated to the query-key component. In our simplified boolean model, we can conceptualize this as follows:

Each token possesses a collection of “key features” which indicate bits of information about contexts where reading information from this token is useful. These can include properties of grammar, logic, mood, or context (food, politics, cats, etc.)
The current token attends to past tokens whose key features have a certain combination of features, which we conceptualize as tokens on whose features a certain boolean “relevance” function, $g_{last token}$ returns $1$ . For example, the current token may ‘want’ to attend to all keys which have feature 1 and feature 4 but not feature 9, or exactly one of feature 2 and feature 8. This corresponds to the boolean function $g = (f_{1} \land f_{4} \land \neg f_{9}) \lor (f_{2} \otimes f_{8})$ . Importantly, the choice of $g$ varies from token to token. We abstract away the question of generating this relevance function as some (possibly complicated) nonlinear computation implemented in previous layers.
Each past token generates a key vector in a certain vector space (associated with an attention head) which is some (possibly nonlinear) function of the key features; the last token then generates a query vector which functions as a linear read-off, and should return a high value on past tokens for which the relevance formula evaluates to True. Note that the key vector is generated before the query vector, and before the choice of which $g$ to use is made.

Importantly, there is an information asymmetry between the “past” tokens (which contribute the key) and the last token that implements the linear read-off via query: in generating the boolean relevance function, the past token can use information that is not accessible to the token generating the key (as it is in its “future” – this is captured e.g. by the attention mask). One might previously have assumed that in generating a key vector, tokens need to “guess” which specific combinations of key features may be relevant to future tokens, and separately generate some read-off for each; this limits the possible expressivity of choosing the relevance function g to a small (e.g. linear in parameter number) number of possibilities.

However, our discovery of circuits that implement universal calculation suggests a surprising way to resolve this information asymmetry: namely, using a universal calculation, the key can simultaneously compute, in an approximately linearly-readable way, ALL possible simple circuits of up to $O log (d_{resid})$ inputs. This increases the number of possibilities of the relevance function $g$ to allow all such simple circuits; this can be significantly larger than the number of parameters and asymptotically (for logarithmic fan-ins) will in fact be superpolynomial^[21]. As far as we are aware, this presents a qualitative (from a complexity-theoretic point of view) update to the expressivity of the attention mechanism compared to what was known before.

Sam Marks’ discovery of the universal XOR was done in this context: he observed using a probe that it is possible for the last token of a transformer to attend to past tokens that return True as the XOR of an arbitrary pair of features, something that he originally believed was computationally infeasible.

We speculate that this will be noticeable in real-life transformers, and can partially explain the observation that transformers tend to implement more superposition than fully-connected neural networks.

2 U-AND: discussion

We discuss some conceptual matters broadly having to do with whether the formal setup from the previous section captures questions of practical interest. Each of these subsections is standalone, and you needn’t read any to read Section 3.

Aren’t the ANDs already kinda linearly represented in the U-AND input?

This subsection refers to the basic U-AND construction from Section 1.1, with inputs not in superposition, but the objection we consider here could also be raised against other U-AND variants. The objection is this: aren’t ANDs already linearly present in the input, so in what sense have we computed them with the U-AND? Indeed, if we take the dot product of a particular 2-hot input with $({\to e}_{i} + {\to e}_{j}) / 2$ , we get 0 if neither the $i$ th nor the $j$ th features are present, $1 / 2$ if 1 of them is present, and 1 if they are both present. If we add a bias of $- 1 / 4$ , then without any nonlinearity at all, we get a way to read off pairwise U-AND for $ϵ = 1 / 4$ . The only thing the nonlinearity lets us do is to reduce this “interference” $ϵ = 1 / 4$ to a smaller $ϵ .$ Why is this important?

In fact, one can show that you can’t get more accurate than $ϵ = 1 / 4$ without a nonlinearity, even with a bias, and $ϵ = 1 / 4$ is not good enough for any interesting boolean circuit. Here’s an example to illustrate the point:

Suppose that I am interested in the variable $z = \land (x_{i}, x_{j}) + \land (x_{k}, x_{l})$ . $z$ takes on a value in ${0, 1, 2}$ depending on whether both, one, or neither of the ANDs are on. The best linear approximation to $z$ is $1 / 2 (x_{i} + x_{j} + x_{k} + x_{l} - 1)$ , which has completely lost the structure of $z$ . In this case, we have lost any information about which way the 4 variables were paired up in the ANDs.

In general, computing a boolean expression with $k$ terms without the signal being drowned out by the noise will require $ϵ < 1 / k$ if the noise is correlated, and $ϵ < 1 / k^{2}$ if the noise is uncorrelated. In other words, noise reduction matters! The precision provided by $ϵ$ -accuracy allows us to go from only recording ANDs to executing more general circuits in an efficient or universal way. Indeed, linear combinations of linear combinations just give more linear combinations – the noise reduction is the difference between being able to express any boolean function and being unable to express anything nonlinear at all. The XOR construction (given above) is another example that can be expressed as a linear combination involving the U-AND and would not work without the nonlinearity.

Aren’t the ANDs already kinda nonlinearly represented in the U-AND input?

This subsection refers to the basic U-AND construction from Section 1.1, with inputs not in superposition, but the objection we consider here could also be raised against other U-AND variants. While one cannot read off the ANDs linearly before the ReLU, except with a large error, one could certainly read them off with a more expressive model class on the activations. In particular, one can easily read ${AND}_{i, j}$ off with a ReLU probe, by which we mean $ReLU (r^{T} x + b$ ), with $r = e_{i} + e_{j}$ and $b = - 1$ . We think there’s some truth to this: we agree that if something can be read off with such a probe, it’s indeed at least almost already there. And if we allowed multi-layer probes, the ANDs would be present already when we only have some pre-input variables (that our input variables are themselves nonlinear functions of). To explore a limit in ridiculousness: if we take stuff to be computed if it is recoverable by a probe that has the architecture of GPT-3 minus the embed and unembed and followed by a projection on the last activation vector of the last position residual stream, then anything that is linearly accessible in the last layer of GPT-3 is already ‘computed’ in the tuple of input embeddings. And to take a broader perspective: any variable ever computed by a deterministic neural net is in fact a function of the input, and is thus already ‘there in the input’ in an information-theoretic sense (anything computed by the neural net has zero conditional entropy given the input). The information about the values of the ANDs is sort of always there, but we should think of it as not having been computed initially, and as having been computed later^[22].

Anyway, while taking something to be computed when it is affinely accessible seems natural when considering reading that information into future MLPs, we do not have an incredibly strong case that it’s the right notion. However, it seems likely to us that once one fixes some specific notion of stuff having been computed, then either exactly our U-AND construction or some minor variation on it would still compute a large number of new features (with more expressive readoffs, these would just be more complex properties — in our case, boolean functions of the inputs involving more gates). In fact, maybe instead of having a notion of stuff having been computed, we should have a notion of stuff having been computed for a particular model component, i.e. having been represented such that a particular kind of model component can access it to ‘use it as an input’. In the case of transformers, maybe the set of properties that have been computed as far as MLPs can tell is different than the set of properties that have been computed as far as attention heads (or maybe the QK circuit and OV circuit separately) can tell. So, we’re very sympathetic to considering alternative notions of stuff having been computed, but we doubt U-AND would become much less interesting given some alternative reasonable such notion.

If you think all this points to something like it being weird to have such a discrete notion of stuff having been computed vs not at all, and that we should maybe instead see models as ‘more continuously cleaning up representations’ rather than performing computation: while we don’t at present know of a good quantitative notion of ‘representation cleanliness’, so we can’t at present tell you that our U-AND makes amount $x$ of representation cleanliness progress and $x$ is sort of large compared to some default, it does seem intuitively plausible to us that it makes a good deal of such progress. A place where linear read-offs are clearly qualitatively important and better than nonlinear read-offs is in application to the attention mechanism of a transformer.

Does our U-AND construction really demonstrate MLP superposition?

This subsection refers to the basic U-AND construction from Section 1.1, with inputs not in superposition, but the objection we consider here could also be raised against other U-AND variants. One could try to tell a story that interprets our U-AND construction in terms of the neuron basis: we can also describe the U-AND as approximately computing a family of functions each of which record whether at least two features are present out of a particular subset of features^[23]. Why should we see the construction as computing outputs into superposition, instead of seeing it as computing these different outputs on the neurons? Perhaps the ‘natural’ units for understanding the NN is in terms of these functions, as unintuitive as they may seem to a human.

In fact, there is a sense in which if one describes the sampled construction in the most natural way it can be described in the superposition picture, one needs to spend more bits than if one describes it in the most natural way it can be described in this neuron picture. In the neuron picture, one needs to specify a subset of size $~ Θ (d_{0} / \sqrt{d})$ for each neuron, which takes $d {log}_{2} (\frac{d_{0}}{~ Θ (d_{0} / \sqrt{d})}) \leq ~ Θ (d_{0}^{2} \sqrt{d})$ bits to specify. In the superpositional picture, one needs to specify $(\frac{d_{0}}{2})$ subsets of size $~ Θ (1)$ , which takes about $~ Θ (d_{0}^{2})$ bits to specify^[24]. If, let’s say, $d = d_{0}$ , then from the point of view of saving bits when representing such constructions, we might even prefer to see them in a non-superpositional manner!

We can imagine cases (of something that looks like this U-AND showing up in a model) in which we’d agree with this counterargument. For any fixed U-AND construction, we could imagine a setup where for each neuron, the inputs feeding into it form some natural family — slightly more precisely, that whether two elements of this family are present is a very natural property to track. In fact, we could imagine a case where we perform future computation that is best seen as being about these properties computed by the neurons — for instance, our output of the neural net might just be the sum of the activations of these neurons. For instance, perhaps this makes sense because having two elements of one of these families present is necessary and sufficient for an image to be that of a dog. In such a case, we agree it would be silly to think of the output as a linear combination of pairwise AND features.

However, we think there are plausible contexts in which such a circuit would show up in which it seems intuitively right to see the output as a sparse sum of pairwise ANDs: when the families tracked by particular neurons do not seem at all natural and/or when it is reasonable to see future model components as taking these pairwise AND features as inputs. Conditional on thinking that superposition is generic, it seems fairly reasonable to think that these latter contexts would be generic.

Is universal calculation generic?

The construction of the universal AND circuit in the “quadratic nonlinearity” section above can be shown to be stable to perturbations; a large family of suitably “random” circuits in this paradigm contain all AND computations in a linearly-readable way. This updates us to suspect that at least some of our universal calculation picture might be generic: i.e., that a random neural net, or a random net within some mild set of conditions (that we can’t yet make precise), is sufficiently expressive to (weakly) compute any small circuit. Thus linear probe experiments such as Sam Marks’ identification of the “universal XOR” in a transformer may be explainable as a consequence of sufficiently complex, “random-looking” networks. This means that the correct framing for what happens in a neural net executing superposition might not be that the MLP learns to encode universal calculation (such as the U-AND circuit), but rather that such circuits exist by default, and what the neural network needs to learn is, rather, a readoff vector for the circuit that needs to be executed. While we think that this would change much of the story (in particular, the question of “memorization” vs. “generalization” of a subset of such boolean circuit features would be moot if general computation generically exists), this would not change the core fact that such universal calculation is possible, and therefore likely to be learned by a network executing (or partially executing) superposition. In fact, such an update would make it more likely that such circuits can be utilized by the computational scheme, and would make it even more likely that such a scheme would be learned by default.

We hope to do a series of experiments to check whether this is the case: whether a random network in a particular class executes universal computation by default. If we find this is the case, we plan to train a network to learn an appropriate read-off vector starting from a suitably random MLP circuit, and, separately, to check whether existing neural networks take advantage of such structure (i.e., have features – e.g. found by dictionary learning methods – which linearly read off the results of such circuits). We think this would be particularly productive in the attention mechanism (in the context of “universal key” generation, as explained above).

What are the implications of using $ϵ$ -accuracy? How does this compare to behavior found by minimizing some loss function?

A specific question here is:

Are algorithms that are $ϵ$ -accurate at U-AND the same as algorithms which minimize the MSE or some other loss function we might write down for training a neural net on the task?

The answer is that sometimes they are not going to be the same. In particular, our algorithm may not be given a low loss by MSE. Nevertheless, we think that $ϵ$ -accuracy is a better thing to study for understanding superposition than MSE or other commonly considered loss functions (cross entropy would be much less wise than either!) This point is worth addressing properly, because it has implications for how we think about superposition and how we interpret results from the toy models of superposition paper and from sparse autoencoders, both of which typically use MSE.

For our U-AND task, we ask for a construction $\to f (\to x)$ that approximately equals a 1-hot target vector $\to y$ , with each coordinate allowed to differ from its target value by at most epsilon. A loss function which would correspond to this task would look like a cube well with vertical sides (the inside of the region $L^{\infty} (\to f (\to x), \to y) < ϵ$ ). This non-differentiable loss function would be useless for training. Let’s compare this choice to alternatives and defend it.

If we know that our target is always a 1-hot vector, then maybe we should have a softmax at the end of the network and use cross-entropy loss. We purposefully avoid this, because we are trying to construct a toy model of the computation that happens in intermediate layers of a deep neural network, taking one activation vector to a subsequent activation vector. In the process there is typically no softmax involved. Also, we want to be able to handle datapoints in which more than 1 AND is present at a time: the task is not to choose which AND is present, but *which of the ANDs* are present.

The other ubiquitous choice of loss function is MSE. This is the loss function used to evaluate model performance in two tasks that are similar to U-AND: the toy model of superposition and SAEs. Two reasons why this loss function might be principled are

If there is reason to think of the model as a Gaussian probability model
If we would like our loss function to be basis independent.

We see no reason to assume the former here, and while the latter is a nice property to have, we shouldn’t expect basis independence here: we would like the ANDs to be computed in a particular basis and are happy with a loss function that privileges that basis.

Our issue with MSE (and $L^{p}$ in general for finite $p$ ) can be demonstrated with the following example:

Suppose the target is $y = (1, 0, 0, \dots)$ . Let $^y = (0, 0, \dots)$ and $~ y = (1 + ϵ, ϵ, ϵ, \dots)$ , where all vectors are $(\frac{d_{0}}{2})$ -dimensional. Then $| | y -^y | |_{p} = 1$ and $| | y - ~ y | |_{p} = {(\frac{d_{0}}{2})}^{1 / p} ϵ$ . For large enough $(\frac{d_{0}}{2}) > ϵ^{- p}$ , the latter loss is larger than 1^[25]. Yet intuitively, the latter model output is likely to be a much better approximation to the target value, from the perspective of the way the activation vector will be used for subsequent computation. Intuitively, we expect that for the activation vector to be good enough to trigger the right subsequent computation, it needs to be unambiguous whether a particular AND is present, and the noise in the value needs to be below a certain critical scale that depends on the way the AND is used subsequently, to avoid noise drowning out signal. To understand this properly we’d like a better model of error propagation.

It is no coincidence that our U-AND algorithm may be $ϵ$ -accurate for small $ϵ$ , but is not a minimum of the MSE. In general, $ϵ$ -accuracy permits much more superposition than minimising the MSE, because it penalises interference less.

For a demonstration of this, consider a simplified toy model of superposition with hidden dimension $d$ and inputs which are all 1-hot unit vectors. We consider taking the limit as the number of input features goes to infinity and ask: what is the optimum number $N (d)$ of inputs that the model should store in superposition, before sending the rest to the zero vector?

If we look for $ϵ$ -accurate reconstruction, then we know how to answer this: a random construction allows us to fit at least $N_{ϵ} (d) = C exp ϵ^{2} d$ vectors into $d$ -dimensional space.

As for the algorithm that minimises the MSE reconstruction loss (ie not sent to the zero vector in the hidden space), consider that we have already put $n$ of the inputs into superposition, and we are trying to decide whether it is a good idea to squeeze another one in there. Separating the loss function into reconstruction terms and interference terms (as in the original paper):

The $n + 1$ th input being stored subtracts a term of order $1$ from the reconstruction loss
Storing this input will also lead to an increase in the interference loss. As for how much, let us write $δ (n)^{2}$ for the average mean squared dot product between the $n + 1$ th feature vector and one of the $n$ feature vectors that were already there. Since the $n + 1$ th feature has $n$ distinct features to interfere with, storing it will contribute a term of order $n δ (n)^{2}$ to the interference loss.

So, the optimum number of features to store can be found by asking when the contribution to the loss $ℓ (n + 1) \sim n δ (n)^{2} - 1$ switches from negative to positive, so we need an estimate of $δ (n)$ . If feature vectors are chosen randomly, then $δ (n)^{2} = O (1 / d)$ and we find that the optimal number of features to store is $O (d)$ . In fact, feature vectors are chosen to minimise interference, which allows us to fit a few more feature vectors in (the advantage this gives us is most significant at small $n$ ) before the accumulating interferences become too large, and empirically we observe that the optimal number of features to store is $N_{L^{2}} (d) = O (d log d)$ . This is much much less superposition that we are allowed with $ϵ$ -accurate reconstruction!

See the figure below for experimental values of $N_{L^{p}} (d)$ for a range of $p, d$ . We conjecture that for each $p, N_{L^{p}} (d)$ is the minimum of an exponential function which is independent of $p$ and something like a polynomial which depends on $p$ .

3 The QK part of an attention head can check for many skip feature-bigrams, in superposition

In this section, we present a story for the QK part of an attention head which is analogous to the MLP story from the previous section. Note that although both focus on the QK component, this is a different (though related) story to the story about universal keys from section 1.4.

We begin by specifying a simple task that we think might capture a large fraction of the role performed by the QK part of an attention head. Roughly, the task (analogous to the U-AND task for the MLP) is to check for the presence of one in a large set of ‘skip bigrams’^[26] of features^[27].

We’ll then provide a construction of the QK part of an attention head that can perform this task in a superposed manner — i.e., a specification of a low-rank matrix $W_{Q K} = W_{K}^{T} W_{Q}$ that checks for a given set of skip feature-bigrams. A naive construction could only check for $d_{head}$ feature bigrams; ours can check for $~ Θ (d_{head} d_{resid})$ feature bigrams. This construction is analogous to our construction solving the targeted superpositional AND from the previous sections.

3.1 The skip feature-bigram checking task

Let $B$ be a set of ‘skip feature-bigrams’; each element of $B$ is a pair of features $({\to f}_{i}, {\to f}_{j}) \in R^{d_{resid}} \times R^{d_{resid}}$ . Let’s define what we mean by a skip feature-bigram being present in a pair of residual stream positions. Looking at residual stream activation vectors just before a particular attention head (after layernorm is applied), we say that the activation vectors ${\to a}_{s}, {\to a}_{t} \in R^{d_{resid}}$ at positions $s, t$ contain the skip feature-bigram $({\to f}_{i}, {\to f}_{j})$ if feature ${\to f}_{i}$ is present in ${\to a}_{t}$ and feature ${\to f}_{j}$ is present in ${\to a}_{s}$ . There are two things we could mean by the feature ${\to f}_{i}$ being present in an activation vector $\to a$ . The first is that ${\to f}_{i} \cdot^{'}$ is always either $\approx 0$ or $\approx 1$ for any $a^{'}$ in some relevant data set of activation vectors, and ${\to f}_{i} \cdot \to a = 1$ . The second notion assumes the existence of some background set ${\to f}_{1}, {\to f}_{2}, \dots, {\to f}_{m}$ in terms of which each activation vector $a$ has a given background decomposition, $a = \sum_{i = 1}^{m} c_{i} {\to f}_{i}$ . In fact, we assume that all $c_{i} \in {0, 1}$ , with at most some constant number of $c_{i} = 1$ for any one activation vector, and we also assume that the ${\to f}_{i}$ are random vectors (we need them to be almost orthogonal). The second notion guarantees the first but with better control on the errors, so we’ll run with the second notion for this section^[28].

Plausible candidates for skip feature-bigrams $({\to f}_{i}, {\to f}_{j})$ to check for come from cases where if the query residual stream vector has feature ${\to f}_{j}$ , then it is helpful to do something with the information at positions where ${\to f}_{i}$ is present. Here are some examples of checks this can capture:

If the query is a first name, then the key should be a surname.
If the query is a preposition associated with an indirect object, then the key should be a noun/name (useful for IOI).
If the query is token T, then the key should also be token T (useful for induction heads, if we can do this for all possible tokens).
If the query is ‘Jorge Luis Borges’’, then the key should be ‘Tlön, Uqbar, Orbis Tertius’.
If the mood of the paragraph before the query is solemn, then the topic of the paragraph before the key should be statistical mechanics.
If the query is the end of a true sentence, then the key should be the end of a false sentence.
If the query is a type of pet, then the key should be a type of furniture.

The task is to use the attention score $S$ (the attention pattern pre-softmax) to count how many of these conditions are satisfied by each choice of query token position and key token position. That is, we’d like to construct a low-rank bilinear form $W_{K}^{T} W_{Q}$ such that the $(s, t)$ entry of the attention score matrix $S_{s t} = {\to a}_{s}^{T} W_{K}^{T} W_{Q} {\to a}_{t}$ contains the number of conditions in $C$ which are satisfied for the query residual stream vector in token position $s$ and the key residual stream vector in the token position $t$ . We’ll henceforth refer to the expression $W_{K}^{T} W_{Q}$ as $W_{Q K}$ , a matrix of size $d_{resid} \times d_{resid}$ that we choose freely to solve the task subject to the constraint that its rank is at most $d_{head} < d_{resid}$ . If each property is present sparsely, then most conditions are not satisfied for most positions in the attention score most of the time.

We will present a family of algorithms which allow us to perform this task for various set sizes $| B |$ . We will start with a simple case without superposition analogous to the ‘standard’ method for computing ANDs without superposition. Unlike for U-AND though, the algorithm for performing this task in superposition is a generalization of the non-superpositional case. In fact, given our presentation of the non-superpositional case, this generalization is fairly immediate, with the main additional difficulty being to keep track of errors from approximate calculations.

3.2 A superposition-free algorithm

Let’s make the assumption that $m$ is at most $d_{resid}$ . For the simplest possible algorithm, let’s make the further (definitely invalid) assumption that the feature basis is the neuron basis. This means that ${\to a}_{s}$ is a vector in ${0, 1}^{d_{resid}}$ . In the absence of superposition, we do not require that these features are sparse in the dataset.

To start, consider the case where $B$ contains only one feature bigram $({\to e}_{i}, {\to e}_{j})$ . The task becomes: ensure that $S_{s t} = {\to a}_{s}^{T} W_{Q K} {\to a}_{t}$ is 1 if feature ${\to f}_{i}$ is present in ${\to a}_{s}$ and feature ${\to f}_{j}$ is present in ${\to a}_{t}$ and 0 otherwise. The solution to this task is to choose $W_{Q K}$ to be a matrix with zero everywhere except in the $i, j$ component: $(W_{Q K})_{k l} = δ_{k i} δ_{l j}$ —with this matrix, ${\to a}_{s}^{T} W_{Q K} {\to a}_{t} = 1$ iff the $i$ entry of ${\to a}_{s}$ is 1 and the $j$ entry of ${\to a}_{t}$ is 1. Note that we can write $W_{Q K} = \to k \otimes \to q$ where $\to k = {\to e}_{i}$ , $\to q = {\to e}_{j}$ , and $\otimes$ denotes the outer product/tensor product/Kronecker product. This expression makes it manifest that $W_{Q K}$ is rank 1. Whenever we can decompose a matrix into a tensor product of two vectors (this will prove useful), we will call it a _pure tensor_ in accordance with the literature. Note that this decomposition allows us to think of $W_{Q K}$ in terms of the query part and key part separately: first we project the residual stream vector in the query position onto the $i$ th feature vector which tells us if feature $i$ is present at the query position, then we do the same for the key, and then we multiply the results.

In the next simplest case, we take the set $B$ to consist of pairs $(e_{i}, e_{j})$ . To solve the task for this $B$ , we can simply perform a sum over $W_{Q K}^{P}$ for each bigram in $B$ , since there is no interference. That is, we choose

$W_{Q K}^{P} = \sum (i, j) \in B {\to e}_{i} \otimes {\to e}_{j}$

The only new subtlety that is introduced in this modification comes from the requirement that the rank of $W_{Q K}^{P}$ be at most $d_{head}$ which won’t be true in general. The rank of $W_{Q K}^{P}$ is not trivial to calculate for a given $B$ . This is because we can factorize terms in the sum:

${\to e}_{j_{1}} \otimes {\to e}_{i_{1}} + {\to e}_{j_{1}} \otimes {\to e}_{i_{2}} + {\to e}_{j_{2}} \otimes {\to e}_{i_{1}} + {\to e}_{j_{2}} \otimes {\to e}_{i_{2}} = ({\to e}_{j_{1}} + {\to e}_{j_{2}}) \otimes ({\to e}_{i_{1}} + {\to e}_{i_{2}})$

which is a pure tensor. The rank requirement is equivalent to the statement that $W_{K W}^{P}$ can contain at most $d_{head}$ terms _after maximum factorisation_ (a priori, not necessarily in terms of such pure tensors of sums of subsets of basis vectors). Visualizing the set $B$ as a bipartite graph with $m$ nodes on the left and right, we notice that pure tensors correspond to any subgraphs of $B$ that are _complete_ bipartite subgraphs (cliques). A sufficient condition for the rank of $W$ being at most $d_{head}$ is if the edges of $B$ can be partitioned into at most $d_{head}$ cliques. Thus, whether we can check for all feature bigrams in $B$ this way depends not only on the size of $B$ , but also its structure.. In general, we can’t use this construction to guarantee that we can check for more than $d_{head}$ skip feature-bigrams.

Generalizing our algorithm to deal with the case when the feature basis is not neuron-aligned (although it is still an orthogonal basis) could not be simpler. All we do is replace ${{\to e}_{i}}$ with the new feature basis, use the same expression for $W_{Q K}^{P}$ , and we are done.

3.3 Checking for a structured set of skip feature-bigrams with activation superposition

We now consider the case where the residual stream contains $m > d_{resid}$ sparsely activated features stored in superposition. We’ll assume that the feature vectors are random unit vectors, and we’ll switch notation from $e_{1}, \dots, e_{d_{resid}}$ to $f_{1}, \dots, f_{m}$ from now on to emphasize that the f-vectors are not an orthogonal basis. We’d like to generalize the superposition-free algorithm to the case when the residual stream vector stores features in superposition, but to do so, we’ll have to keep track of the interference between non-orthogonal f-vectors. We know that the root mean square dot product between two f-vectors is $1 / \sqrt{d_{resid}}$ . Every time we check for a bigram that isn’t present and pick up an interference term, the noise accumulates—for the signal to beat the noise here, we need the sum of interference terms to be less than $1$ . We’ll ignore log factors in the rest of this section.

We’ll assume that most of the interference comes from checking for bigrams $({\to f}_{i}, {\to f}_{j})$ where ${\to f}_{i}$ isn’t in ${\to a}_{s}$ and also ${\to f}_{j}$ isn’t in ${\to a}_{t}$ — that cases where one feature is present but not the other are rare enough to contribute less can be checked later. These pure tensors typically contribute an interference of $1 / d_{resid}$ . We can also consider the interference that comes for checking for a clique of bigrams: let $K$ and $Q$ be sets of features such that $B = K \times Q$ . Then, we can check for the entire clique using the pure tensor $(\sum_{j \in K} {\to f}_{j}) \otimes (\sum_{i \in Q} {\to f}_{i})$ . Checking for this clique of feature bigrams on key-query pairs which don’t contain any bigram in the clique contributes an interference term of $\sqrt{| K | | Q |} / d_{resid}$ assuming interferences are uncorrelated. Now we require that the sum over interferences for checking all cliques of bigrams—of which there are at most $d_{head}$ - is less than one. Since there are at most $d_{head}$ cliques, then assuming each clique is the same size (slightly more generally, one can also make the cliques differently-sized as long as the total number of edges in their union is at most $d_{resid}$ ) and assuming the noise is independent between cliques, we require $\sqrt{| K | | Q |} / d_{resid} < 1 / \sqrt{d_{head}}$ . Further assuming $| K | = | Q |$ , this gives that at most $| K | = | Q | = d_{resid} / \sqrt{d_{head}}$ . In this way, over all $d_{head}$ cliques, we can check for up to $d_{resid}^{2}$ bigrams, which can collectively involve up to $d_{resid} \sqrt{d_{head}}$ distinct features, in each attention head.

Note also that one can involve up to $d_{head} d_{resid}$ features if one chooses $| K | = 1$ and $| Q | = d_{resid}$ (or the other way around) for each clique. In that case, noise from situations where the small side f-vector gets hit dominates — this is what forces the large side to have size at most $d_{resid}$ .

(Note how all these numbers compare to the parameter count of $d_{resid} d_{head}$ .)

3.4 Checking for a smaller unstructured set of feature pairs in superposition

We now consider the case that we would like to check for an arbitrary set of feature pairs. This is analogous to the task of computing a subset of ANDs of inputs in superposition. In this general case, we can’t assume that they form large cliques.

The construction is a generalization of our non-superpositional construction: we take a sum of pure tensors, one for each pair in $B$ , and then take a low rank approximation at the end. We will now work through the details to figure out just how much computation we can fit in before the noise overwhelms the signal.

To be precise, the construction is that we let ${^W}_{Q K} := {^W}_{Q K} (B) = \sum_{(i, j) \in B} {\to f}_{i} \otimes {\to f}_{j}$ with $| B | > d_{head}$ . We’ll continue the assumption that ${{\to f}_{i}}$ are random vectors. To ensure that the matrix is rank $d_{head}$ we will need to project it down somehow: we pick $d_{head}$ random gaussian vectors, and write a projection matrix $R$ which projects to the subspace spanned by these random vectors. In fact we will choose $R$ to be this projection matrix scaled up by an amount $\frac{d_{resid}}{d_{head}}$ so that $(R {\to f}_{i}) \cdot {\to f}_{i} = 1$ . Then we write $W_{Q K} = {^W}_{Q K} R$ .^[29]

We’ll give a heuristic argument now that this construction works — in particular, that it lets one make a QK circuit which checks for a generic set of up to $d_{resid} d_{head}$ bigrams (up to log factors), without assuming any structure to those bigrams.

We’d like to understand the size of noise in our QK-circuit, i.e. to understand ${\to n}_{1}^{T} W_{Q K} {\to n}_{2} = {\to n}_{1}^{T}^W R {\to n}_{2} = {\to n}_{1}^{T} ⎛ ⎝ \sum (i, j) \in B {\to f}_{j} \otimes (R {\to f}_{i}) ⎞ ⎠ {\to n}_{2} = \sum (i, j) \in B ({\to n}_{1} \cdot {\to f}_{j}) ({^{'}}_{i} \cdot {\to n}_{2})$ in the case that ${\to n}_{1}, {\to n}_{2}$ are random unit vectors. Each term in the sum is of size $\frac{1}{\sqrt{d_{resid} d_{head}}}$ , so the total noise is $\sqrt{\frac{| B |}{d_{resid} d_{head}}}$ .

To understand the size of noise in our QK-circuit, we can see what happens when the residual stream vectors are replaced with random unit vectors ${\to n}_{1}, {\to n}_{2} \notin {{\to f}_{j}}$ . This simulates what we’d pick up if the two token positions of interest each had a single feature active, neither of which were in our set of bigrams. In this case we have

${\to n}_{1}^{T} W_{Q K} {\to n}_{2} = {\to n}_{1}^{T}^W R {\to n}_{2} = {\to n}_{1}^{T} ⎛ ⎝ \sum (i, j) \in B {\to f}_{i} \otimes (R {\to f}_{j}) ⎞ ⎠ {\to n}_{2} = \sum (i, j) \in B ({\to n}_{1} \cdot {\to f}_{i}) ({\to f}_{j}^{'} \cdot {\to n}_{2})$

${\to f}_{i}^{'}$ is a vector with a typical size of $\sqrt{\frac{d_{resid}}{d_{head}}}$ due to the rescaling of $R$ . Therefore each term in the sum is typically of size $\frac{1}{\sqrt{d_{resid} d_{head}}}$ , so exploiting that each term in the sum is independent, the total noise is on the order of $\sqrt{\frac{| P |}{d_{resid} d_{head}}}$ . Now, if the key and query vector have $κ_{K}$ and $κ_{Q}$ features active respectively, with none of these features in any of our bigrams, then the total noise is $\sqrt{\frac{κ_{K} κ_{Q} | P |}{d_{resid} d_{head}}}$ .

We might wonder what the noise term is from pure tensors ${\to f}_{i} \otimes v e c f_{j}^{'}$ where ${\to f}_{i}$ is present in ${\to a}_{s}$ but ${\to f}_{j}$ is not present in ${\to a}_{t}$ (or the other way around). In this case, the size of the noise term will be $1 / \sqrt{d_{head}}$ or $1 / \sqrt{d_{resid}}$ , depending on whether the feature is present in the query or the key^[30].

As for the size of the signal, (ie the size of ${\to a}_{s}^{T} W_{Q K} {\to a}_{t}$ for residual stream vectors in positions $s, t$ which contain a bigram in $B$ ), we have

${\to a}_{s}^{T} {^W}_{Q K} {\to a}_{t} = {\to f}_{i^{'}}^{T}^W R {\to f}_{j^{'}} = {\to f}_{i^{'}}^{T} ⎛ ⎝ \sum (i, j) \in B {\to f}_{j} \otimes (R {\to f}_{i}) ⎞ ⎠ {\to f}_{j^{'}} = \sum (i, j) \in B ({\to f}_{i^{'}} \cdot {\to f}_{i}) ({\to f}_{j}^{'} \cdot {\to f}_{j^{'}})$

where $({\to f}_{i^{'}}, {\to f}_{j^{'}}) \in B$ . Since we rescaled $R$ , the term in the sum for $i = i^{'}, j = j^{'}$ is equal to $1$ . For other terms in the sum, we get interference terms on the same scale as the noise above.

This means that in order for the signal to be larger than the noise, i.e. for us to get readoffs that are always in $1 \pm ϵ$ or $\pm ϵ$ , we require $| B |$ to be no larger than $~ Θ (d_{resid} d_{head})$ , and that no one feature is present in more than $~ Θ (d_{head})$ of the skip feature-bigrams. Note that the former condition implies the latter if we are allowed to further assume that the set of pairs in $B$ is generic: if the pairs are chosen at randomly, for $m ≫ d_{resid}$ , each f-vector will be chosen roughly $d_{resid} d_{head} / m ≪ d_{head}$ times.

3.5 Copy-checker heads and structure-exploiting algorithms

Sometimes (often?) it is possible to check for a much larger set of skip feature-bigrams than any of the above algorithms suggest. This is when a large number of features are related to each other by a linear map, which may happen when there is a simple relationship between some subset of features and another subset. For example, perhaps there are a large number of female name features like {Michelle Obama, Marie Curie, Angelina Jolie...} and another large number of features corresponding to their husbands {Barack Obama, Pierre Curie, Brad Pitt...}. Then, the NN may be incentivised to arrange these features in such a way that there is a linear map that takes all female name features to their husband’s feature, because this will allow an attention head to attend from the woman to instances of her husband in the text.

To see how this works, let $F = {\to f}_{1}, \dots, {\to f}_{m}$ be an almost orthogonal overbasis of f-vectors (which can be exponentially large), and let $M$ be an arbitrary orthogonal $d \times d$ matrix such that for all $i$ , $M {\to f}_{i}$ is approximately equal to at most one f-vector, and almost orthogonal to all the others. Let $Φ \subseteq F$ be the set of f-vectors which are mapped to another vector in $F$ by $M$ and let $Ψ = M Φ = {M_{i} |_{i} \in Φ} \subseteq F$ . One such setup can be achieved as follows: choose $M$ to be a random orthogonal matrix, and let $Φ$ be an almost orthogonal set of unit vectors of size $m / 2$ . Then, with high probability, $F := Φ \cup Ψ = Φ \cup M Φ$ is also almost-orthogonal. Now let $B = {({\to f}_{i}, M {\to f}_{i}) | {\to f}_{i} \in Φ}$ .

Then, choosing $W_{Q K}$ to be is a random rank $d_{head}$ approximation of $M$ (scaled up by $\frac{d_{resid}}{d_{head}}$ ) will allow us to check for every element of $B$ at once: For any $i$ , if feature $ϕ_{i}$ is in the query, then it will be mapped to a random scaled $d_{head}$ dimensional projection of $ψ_{i}$ by $W_{Q K}$ , and contribute 1 to the dot product. Noise terms will be of size $1 / \sqrt{d_{resid}}$ .

In the husband-wife case, $Φ$ is the set of women and $Ψ$ is the set of their husbands. Then, an attention head which chooses $W_{Q K}$ to be a low rank approximation to $M$ can check for exponentially many wife-husband bigrams by exploiting that each wife feature can be mapped to the husband feature by the same linear transformation (the same rotation if we insist that $M$ is orthogonal). Of course, this working depends on the very nontrivial assumption that there is this linear relation — this is probably false for these particular pairs in real models; it’s just an illustration, though see this paper which observes a similar phenomenon for relations between sports players and their sports, and in several other examples.

A special case of this is if $ϕ_{i} = ψ_{i}$ for all $i$ . In this case, the set $B$ corresponds to a family of bigrams like “if the query has feature $i$ then the key should have feature $i$ also”, and the keys that get paid the most attention to are those that are composed of the most similar features as the query. That is, $M$ is the identity, and the attention head is performing the function of a copy-checker head.

The K-composition version of an induction head does something similar: Use the OV circuit of a previous head to copy many features from one subspace to another. Then choose $W_{Q K}$ to be $W_{O V}^{T}$ of the previous head.

So, it is possible to understand many of the functions that attention heads are previously known to perform in the lange of skip feature-bigram checking, which is good news. On the other hand, if many of the most important things done by attention heads exploit this linear structure, then it may be counterproductive to think in terms of memorized skip feature-bigrams. Certainly the skip feature-bigram description for copy-checker heads is less simple than the traditional description.

We think it is plausible there are also interesting constructions that combine the unstructured and structure-exploiting algorithms. That is, we can probably take $W_{Q K}$ to track some unstructured union of linearly related feature pairs. We leave investigating this to future work.

Generalization as a limit of memorization

So, in our picture, copy-checker heads are attention heads which exploit the linear structure of the activation space to check for many conditions of the form

if feature x is in query, then it should also be in key

at the same time. Ths is conceptually subtly different to the standard story for copy-checker heads, in which we think of them as asking the more general question

Which features are in the query? Those features should also be in the key

or even

Is the key the same vector as the query?

Even though the two descriptions describe the same behavior, we think that ours offers a story of how these general purpose attention heads can be learned:

Consider a setup without residual stream superposition. If the loss on some batch would be lower by checking for ‘if feature 16 is present in the query, then feature 16 is present in the key’, then perhaps that ‘identity’ bigram gets learned. So, $W_{Q K}$ is updated from being the zero matrix to a matrix with a 1 in the $(16, 16)$ position (when written in the feature basis on the left and right). In a sense, this is a form of memorisation: the general task of language modeling would benefit from a copy-checker head here, but the model only learned to copy a specific feature that it saw on a particular batch. Over subsequent training, more 1s are placed along the diagonal, until eventually $d_{head}$ identity bigrams have been memorized. At this point, we notice that $W_{Q K}$ has become the identity matrix (in a $d_{head}$ dimensional subspace), which is exactly the matrix that the generalizing algorithm (a copy-checker head which can copy any query vector back) requires. In this setup, enough memorization precisely led to generalization!

This also works, and looks somewhat more magical, if we allow the residual stream to contain a sparse overbasis (feature vectors are assumed to be random unit vectors again). Now, each time a specific identity bigram is learned, we have ${^W}_{Q K}$ (the bilinear form before projection to a random $d_{head}$ dimensional subspace) is replaced with ${^W}_{Q K} + {\to f}_{i} \otimes {\to f}_{i}$ for some particular $i$ . After $m$ bigrams have been learned, we have (after rescaling)

$(^WQK)kl=dresidmm∑i=1(fi)k(fi)l→{1,k=l1/√m,k≠l$

This approaches the identity as $m$ grows (this can be made precise with the usual Chernoff and union bounds), such that the projection $W_{Q K}$ approaches the low rank identity required for the generalizing copy-checker head.

4 QK: discussion

We have a few thoughts about how well this description captures the role of the QK circuit.

Where does softmax fit in?

If features are present in inputs with probability (sparsity) $s$ , then skip feature-bigrams should generically be satisfied with probability $s^{2}$ (assuming independence). For sparse enough inputs, it is very unlikely for more than one pair skip feature-bigram to be present on any pair of positions. In this case, entries in the attention score are almost always in ${0, 1}$ and the QK circuit can be thought of as computing $⋁_{(i, j) \in B} (is f_{i} present in ({\to a}_{s}) \land is f_{j} present in ({\to a}_{t}))$ . In this case, if we scale up the QK circuit so that entries in the attention score are in ${0, 100}$ , then the softmax will kill the zero entries, and each row of the attention pattern will have entries that were 100 replaced with $1 / r$ where $r$ is the number of nonzero entries in the row. This makes sense — it will correspond to taking an arithmetic mean of the value vectors in the $r$ positions that contain the first element of a feature bigram (with the second element of the pair in the query position). If, for a particular query, there is only one key that has a feature bigram in $B$ with it, then this key will be attended to entirely.

However, if the features are less sparse, our task isn’t to check whether one of a set of feature bigrams is present, but rather count the number of pairs which are present. This means that for a particular query, if we scale up the QK circuit, then the attention pattern will be nonzero only on whichever key contains the most feature bigrams with the query (or on whichever set of keys ties for first place). We aren’t sure if this is a feature or a bug.

Maybe attention layers really only want to pay attention to one or a few previous tokens. Softmax really implies that there is a limited amount of attention to go around (it has to add to 1 for each query) so maybe it should all be allocated to whichever keys have the most feature bigrams with the query.
Alternatively, we might want to allocate only somewhat more attention to keys which contain $k$ feature bigrams with the query than to keys which contain $k - 1$ . This means we can’t scale up the QK circuit much, which means that we will end up paying some attention to keys which host no bigrams with the query.

Unknown unknowns

Attention layers are hard to interpret, not least because softmax is a beast. While it is known that attention patterns are good at looking back through the sequence for information and moving it around, it is not known if that is _all_ that they do (of course this limitation is not specific to our work). We make no predictions about whether future researchers will find entirely different things that the QK circuit can do that looks nothing like checking for skip feature-bigrams.

Does our QK construction really demonstrate superposition?

Just as it was possible to tell a story of the U-AND construction that didn’t leverage superposition, it is possible to describe the construction of section 3.4 without mentioning superposition. In particular, the natural non-superpositional story would be to describe the matrix $W_{Q K} = \sum_{(i, j) \in B} {\to f}_{i} \otimes (R {\to f}_{j})$ through its SVD:

$W_{Q K} = d_{head} \sum i = 1 σ_{i}_{i} \otimes_{i}$

We know that the sum only ranges over $i = 1, \dots, d_{head}$ because $W_{Q K}$ has rank at most $d_{head}$ . So we can interpret the QK circuit as calculating precisely $d_{head}$ different projections on the right and on the left, multiplying the pairs and adding them, at each query and key token position.

The problem with this story is that each projection (each term like ${\to v}_{i} \cdot {\to a}_{t}$ ) doesn’t have a nice interpretation in terms of our boolean features: it is some linear combination of the features with no short description length in terms of boolean variables. In general, the right and left singular bases of $W_{Q K}$ have little to do with the residual stream overbasis, and if our goal is interpretability, we’d really like to understand $W_{Q K}$ in the left and right feature overbasis, which is what we have done in this post.

4 How relevant are our results to real models?

The bounds we give in this paper are asymptotic and tend to have bad constant (or logarithmic) terms that are likely quite suboptimal. In some back-of-the-envelope calculations and experiments we did, they give high interference terms for modest model widths (on the order of hundreds of neurons). However, we believe that real networks might learn algorithms of a similar type that have much better constants, and thus implement efficient computation for realistic values. We hope that our asymptotic results capture qualitative information about what processes can be learned effectively in real-world models, rather than that our bespoke mathematical algorithms are the best possible.

More generally, we think that boolean computation can explain only a piece of the computational structure of the interpretation of a neural net. Some examples that are likely to be boolean-interpretable are bigram-finding circuits and induction heads. However, it’s possible that most computations are continuous rather than boolean^[31]. Second, many computations that occur in neural nets may not be best understood as boolean-style circuits, because the bits have important mathematical structure. In this case, the best interpretation may reference a range of mathematical components instead, like the complex multiplication map in modular addition. Nevertheless, we think that understanding boolean circuits is important, and we hope to come up with analogous results for continuous variables in the future.

So, the degree to which the picture we paint captures the computation happening in real transformer models is not clear to us. There are a range of options here.

As far as we know, it’s possible that transformer activations are not best thought of as being in superposition — that all representations are compositional (see here or here for more discussion) or even best seen in some entirely different way, e.g. perhaps as having some structure that involves less linearity. There are many possibilities that have yet to be pinned down, and we don’t want to contribute towards privileging any particular hypothesis.
It could be that transformer activations are best thought of as using superposition, but that they do not implement anything like our toy constructions at all, e.g. because there are additional major structures in a transformer that our toy constructions do not make use of – an example of a possible such structure is the notion of linear relations between related subsets of features, as found in this paper and referenced in the “Structure-exploiting algorithms” section above (though this would be a refinement on top of our boolean feature picture rather than a completely different model).
Components similar to the circuits we identify show up in real transformers.

We note that if circuits like the ones we describe do turn out to be present and useful in real transformers, there are two ways in which we expect the picture to be made more sophisticated. First, it has been observed that many computations that can be done in a single layer in a transformer are instead spread out (perhaps via random optimisation processes) to be gradually done over many layers. Second, there is evidence that there is important additional structure to the arrangement of the feature vectors. We think it would be interesting and natural to try to combine such additional structure with our picture of computation in superposition, and produce a more expressive (and, hopefully, more complete) theory of computation. We gesture at the beginnings of such a picture at the bottom of the section on the QK circuit, but a more complete picture of this type is outside our scope.

5 Open directions / what we’re thinking about now

These are very rough bullet point lists. The items in each list are in no particular order, and the ordering of lists is not particular, either. Please get in touch with us if you are interested in pursuing any of these ideas, or if you want to talk through other theory/experiment ideas that aren’t on the list. If no one does so, we might publish a more fleshed-out set of ideas for future work.

The OV circuit

We think it might be interesting to understand a possible implementation of the OV circuit in terms of our formalism, to complement our study of the QK circuit above. In brief: the QK component above ‘issues a command to move information’ if one of a certain set of ordered feature pairs is present. It is canonical wisdom that the rank $d_{head}$ matrix $W_{O V}$ gets to choose which information to move (i.e., from where in the residual stream to take information) and where in the residual stream to put it. In the language of sparse boolean features, a natural thing one can ask of the OV circuit is to fulfill a list of instructions of the form ‘if $f_{i}$ is present in the residual stream at the attended-to position, modify the residual stream at this token position to change the value read off by dot product with the read-off vector $r_{j}$ ’’. By the same computation as in the QK section, a natural choice that’d work is $\sum_{i = 1}^{m} f_{i} \otimes r_{i}$ ; to make it have rank $d_{head}$ , we again pick a projection $R$ from $R^{d_{resid}}$ down to a random $d_{head}$ subspace and use $\sum_{i = 1}^{m} f_{i} \otimes (R r_{i})$ . Here, as before, $m$ can be up to $d_{head} d_{resid}$ up to a polylog factor^[32]. Or we may again also consider variants with pure tensors where both of the tensor ANDs are sums of features.

This story is preliminary and hasn’t been worked out in detail at the time of writing. One issue is that often attention heads do not attend to a single previous token position, but rather a mixture of several previous positions. Combining many value vectors in linear combination could break sparsity, and could also result in features being non-binary. We’d like to work on this story more in future.

Specifying concrete use cases

Pin down concrete tasks (with a dataset and loss function) that require each of these constructions (or some similar variant) to be implemented in order for the task to be done.
Alternatively, explain why there wouldn’t be such tasks, or why nothing practical could have this form. More generally, improve our understanding of when constructions like the ones presented here are useful.
Once a suitable task has been identified, train and see if the low loss solution can be found.

Genericity questions

We hope to run a series of experiments to check whether universal calculation is executed by random MLP’s (see the section “Is universal calculation generic” in the FAQ above). Specifically, we plan to train a readoff vector starting with a randomly initialized MLP to see whether it can accurately learn to read the output of suitable circuits.

Reverse-engineering

Suppose we can identify a task that requires some of our constructions, and we can train a model to perform well at those constructions. Which techniques allow it to be reverse engineered? Which interpretability techniques lead to a misinterpretation of what is happening^[33]?

Understanding errors

Understand how error propagates through multiple layers of such calculations
Understand how keeping errors small trades off against various other parameters
Come up with sparse error-correction components (or argue that there couldn’t be any)

Clarifying the model of computation

Write down a formal model of computation that describes what these components can compose to. Something like: a set of features starts off at each position; new features are computed from these by alternating cross-token and local sparse boolean operations.
Something about computation that involves negations (negations are in some tension with sparsity)
We can write down a universal AND in exponentially many features in the quadratic activation context, but in the ReLU context it seems that we are currently hitting some barrier around num of sparse gates = num of params. Note that without the flexibility of allowing linear readoffs, this would be a general information-theoretic bound, but with the linear readoffs, that bound, in full generality, is definitely false (otherwise the quadratic U-AND construction would be impossible). It is interesting to us whether a more efficient universal AND is possible in the ReLU context, or if this is a fundamental bound in this case. We also see that the number of bigrams that the QK circuit can check for is bounded by the number of parameters. We’d really like to understand what is going on here—is there some deeper result that explains why this limit is hit in a diverse set of places? (Relatedly, it is known that one can similarly have a neural net memorize as many data points as it has parameters, though for finite bit complexity, there is a matching information-theoretic upper bound. (Without a bit complexity bound, one can actually (do more)
Find a way to interpolate between the universal AND construction and the (slightly less efficient) targeted superpositional AND. One idea: if one is in the gate sparsity regime where there are triangles in $E$ , one might want to introduce some 3-way correlations (and so on for other correlations). E.g. whenever $E$ has a triangle $i j k$ , we’d pick a random set of neurons at which columns $i, j, k$ of $^W$ have unusually high density. Maybe there’s some universally good construction like this which has a contribution for every (maximal?) clique in the gate graph $E$ . And then maybe the universal AND is the special case where the entire gate graph is just one big clique, and the construction we provide above is the special case where the gate graph is really sparse (specifically, has basically no triangles).
Characterize the input distributions and boolean circuits for which the number of nodes which get a 1 in any layer is bounded^[34].
Maybe a more appropriate question would be to characterize the input distributions and boolean circuits such that the number of nodes which can be turned on across the entire circuit is bounded (this seems natural if we think of everything being computed into the residual stream of a transformer and never being erased, and we think of there being a uniform bound on the compositeness across layers). For instance, among all circuits with this property, which kinds should we think of as generic — if we pick a uniformly random such circuit, what’s the distribution of the number of nodes for each layer? Are the layer sizes fairly concentrated around a certain profile? Does this induce a fairly concentrated profile for the number of features that are ON in each layer? Does any of this have anything to do with residual stream vectors growing exponentially over the forward pass? (Let’s say we define ‘layer’ as constructed in the proof of Mirsky’s Theorem.)
Come up with more appropriate boolean circuit questions than the above two, and answer those

Potential reframings

It currently seems plausible to us that ~whenever we say something is a sparse linear combination of feature vectors corresponding to some properties, we could instead say that there are readoffs for these properties (that are only rarely on, or out of which only a small subset is ever on). Can this post indeed be rewritten in terms of readoffs only? Very briefly, the intuition is that model components just care about readoffs, not about the structure of activations. Especially if this program goes through, then it seems likely to us that ‘readoffs are more fundamental than activations being linear combinations of features’ and any linear-combination-of-features model should either be derived (given some auxiliary reasonable assumptions) from a readoff picture (e.g. from considerations having to do with how stuff needs to get computed) or should be dropped in favor of the dual picture.
Understand how work on hyperdimensional computing relates to this (ht to Jonathon Liu for telling us there might be a connection)

How applicable are our setups to the real world?

Using techniques for reverse-engineering circuits that compute in superposition developed while studying toy models, study models in the wild to see if similar circuits are learned.
Advance our understanding of how representative these algorithms are. Do the toy tasks capture most/any real-world behavior? For example, copy heads and their cousins exploit structure to do more powerful operations than our simple model suggests are possible, and we think it’s likely that there is lots of other structure that we are currently missing.
Neural networks may operate in part with sparse features in superposition, and in part with compositional, dense features. We’d like to understand whether this is a true dichotomy or a spectrum, and how computation in superposition can interface with compositional parts of a network.
Find constructions that can handle non-binary features. Alternatively, explain why computation in superposition is not possible in the same way with continuous features.
Understand better how anything like this would be learned. Maybe there’s some story of superpositional feature ecology involving a sequence of local steps of representing increasingly complicated things that are simple functions of existing things?
Think about how much direct sense any of this makes for other architectures

Acknowledgments

We’d like to thank Nix Goldowsky-Dill, Simon Skade, Lucius Bushnaq, Nina Rimsky, Rio Popper, Walter Laurito, Hoagy Cunningham, Euan Ong, Aryan Bhatt, Hugo Eberhard, Andis Draguns, Bilal Chughtai, Sam Eisenstat, Kirke Joamets, Jonathon Liu, Clem von Stengel, Callum McDougall, Lee Sharkey, Dan Braun, Aaron Scher, Stefan Heimersheim, Joe Benton, Robert Cooper, Asher Parker-Sartori, and probably a bunch of other people we’re unfairly forgetting now, for discussions and comments.

Attributions

In general, much happened in discussions, and many ideas of a member of the trio were built on top of previous ideas by another member. The following is a loose approximation, with many subtle and less subtle contributions omitted to keep it manageable.

The three authors would like to gratefully acknowledge Nix Goldowsky-Dill, who wrote an early version of the summary and helped with distillation (but declined to be named a coauthor). Jake and Kaarel posed the U-AND problem, providing the notions of representation involved. Dmitry came up with the first construction solving the U-AND tasks, as well as with the quadratic U-AND. Kaarel came up with the targeted superpositional AND. Jake led the write-up and editing efforts, with technical content largely based on informal notes by Kaarel; he also produced our finalized introductory sections based on Nix’s summary. The discussion and experiments comparing $ϵ$ -accuracy to loss functions are Jake’s.

Kaarel came up with the initial structured and unstructured QK circuit constructions. The structure-exploiting variant came out of a discussion between Dmitry and Kaarel, and the associated story about memorization and generalization had contributions from Dmitry, Jake, and Kaarel. Jake clarified and simplified these ideas considerably, and wrote most of the QK section. OV is from Kaarel. Dmitry and Jake came up with Universal Keys; Dmitry wrote that section. The three all contributed significantly to the section on open directions. The appendix is Kaarel’s, with some contributions by Dmitry and Jake.

Jake is a Research Scientist and Kaarel is a contractor at Apollo Research, and we would like to thank them for supporting this effort. Kaarel is a Research Scientist at Cadenza Labs. Dmitry is a post-doc at IHES.

Appendix: a note on linear readoffs, linear combinations, and almost orthogonality

This appendix is largely independent from the rest of the paper, other than that it explains a distinction between almost orthogonal overbases and the more general concept, which we will define, of linearly $ϵ$ -readable overbases, which is what we think might be what is actually learned by neural nets (and which has the same good behavior from the point of view of a neural net and linear readability). We plan to post a version of this as a separate post, as we think it is a useful distinction and a plausible source of confusion. For the point of view of the (synthetic) algorithms of the present paper, either of these concepts can be used for our basis of f-vectors (modulo some issues with controlling errors).

Here we discuss this idea and a failed attempt to find additional structure (similar to $ϵ$ -orthogonality) in linearly readable overbases. We then briefly discuss the possibility of linearly reading off features in the presence of linear relations between f-vectors, as well as a bound on the number of features that can be linearly read off in this setup.

The structure of activation vectors

Here’s the setup. We have a data set $X = {x_{1}, \dots, x_{D}}$ of inputs to a model that then produces a respective data set $A = {a_{1}, \dots, a_{D}} \subseteq R^{d}$ of activation vectors, with $a_{i} = a (x_{i})$ [To be clear: we are letting $a$ be the function that is implemented in the model to compute the activation vector in a particular activation space.]. For example, each $x_{i}$ might be a particular sentence, the model might be GPT-2, and the corresponding $a_{i}$ might be the residual stream activation vector at the last token position just after the fourth MLP. There are $m$ functions $f_{1}, \dots, f_{m} : X \to {0, 1}$ — we will think of these as the features (i.e., properties) of inputs which are represented in this particular activation space. We assume that we are in the superpositional regime: $m ≫ d$ , but for each $x \in X$ , the set of features which are on is small — in fact, that for each $x \in X$ , there are at most $ℓ ≪ d$ indices $i \in [m]$ with $f_{i} (x) = 1$ ^[35]. In fact, we assume that activation vectors are defined in terms of these properties in a particular linear way: that there are vectors ${\to f}_{1}, \dots, {\to f}_{m} \in R^{d}$ — we call these the f-vectors corresponding to the properties — such that $a (x) \approx \sum_{i = 1}^{m} f_{i} (x) {\to f}_{i}$ . Actually, let’s make this a precise equality just to make our job a bit easier; we assume that each activation vector is $a = {\to f}_{i_{1}} + {\to f}_{i_{2}} + \dots + {\to f}_{i_{ℓ^{'}}}$ for some $ℓ^{'} < ℓ$ and indices $i_{1}, i_{2}, \dots, i_{ℓ^{'}}$ . We’ll think of the compositeness $ℓ$ as a constant and $d$ as large (and $m$ larger still). In fact, we’ll primarily consider what happens asymptotically in $d$ . For a concrete example, one can take= $ℓ = 10$ , $d = 1000$ , $m = 100000$ , for example.

Linear readability and its consequences

To be able to directly compute other properties out our basic feature vectors, it would be good for each of these properties to be linearly readable, by which we mean that for each $i$ , there’s a vector ${\to r}_{i} \in R^{d}$ ^[36] such that ${\to r}_{i}^{T} \to a (x) \approx f_{i} (x)$ for all $x$ . Let’s say this again:

> Definition. Let $X$ be a set of inputs, let $\to a : X \to R^{n}$ give the corresponding activation vectors (in a particular position/layer in a given model). We say that $f_{1}, \dots, f_{m}$ are linearly readable up to error $ϵ$ from these activation vectors if there are vectors ${\to r}_{1}, \dots, {\to r}_{m} \in R^{d}$ such that for all $i \in [m]$ and $x \in X$ , we have $| {\to r}_{i}^{T} \to a (x) - f_{i} (x) | \leq ϵ$ ^[37].

Let’s think about what kinds of f-vector families ${\to f}_{1}, \dots, {\to f}_{m}$ would give rise to activation vectors from which $f_{1}, \dots, f_{m}$ are linearly readable up to error $ϵ$ . Let’s first note that if $| {\to r}_{i}^{T} {\to f}_{j} - δ_{i j} | \leq ϵ$ — let’s call this the f-vectors ${\to f}_{1}, \dots, {\to f}_{m}$ being linearly readable up to error $ϵ$ — then $f_{1}, \dots, f_{m}$ are linearly readable up to error $k ϵ$ ^[38]. Conversely, at least assuming the data set is rich enough to have a minimal pair for each feature $f_{i}$ , i.e. a pair of inputs $x_{1}, x_{2} \in X$ such that $f_{i^{'}} (x_{2}) - f_{i^{'}} (x_{1}) = δ_{i i^{'}}$ (think of this as a condition that the features should be sort of independent of each other — in particular, if there’s a feature whose value is uniquely determined by the values of other features, this would be false), the features being linearly readable up to error $ϵ$ from activation vectors implies that the f-vectors ${\to f}_{1}, \dots, {\to f}_{m}$ are linearly readable up to error $2 ϵ$ , too. So, at least for constant $k$ , features being linearly readable from activations is roughly the same as the underlying f-vectors being linearly readable. A precise statement we could make here is that if we fix some function $g (d)$ , then a sequence as $d \to \infty$ of such setups having features be linearly readable up to error $O (g)$ from activations is equivalent to the sequence of corresponding f-vector sets being linearly readable up to error $O (g)$ . So, while it is perhaps prima facie better-justified to ask for features being linearly readable up to error $ϵ$ from activation vectors, it’s (more or less) equivalent to ask for f-vectors being linearly readable up to error $ϵ$ , and this is mathematically nicer, so let’s proceed to think about that instead. If you are worried about this switch not being entirely rigorous, don’t be: the only thing we really logically need for what we’re about to say is that f-vectors being linearly readable up to error $ϵ$ implies that features are linearly readable from activations up to error $O (ϵ)$ . The reason this is sufficient for our express purpose of understanding whether linear readability of features implies that the f-vectors have some other interesting structure (perhaps structure that could help us identify f-vectors in practice^[39]) is that this implies that constructing a set of f-vectors ${\to f}_{1}, \dots, {\to f}_{m}$ which are linearly readable up to error $ϵ / k$ but that do not have some certain property also gives a construction where the corresponding features are linearly readable up to error $ϵ$ from activation vectors but the underlying ${\to f}_{1}, \dots, {\to f}_{m}$ do not have that property — just take the data set of activation vectors to consist of all sums of up to $k$ of the ${\to f}_{1}, \dots, {\to f}_{m}$ .

Let’s think about what kinds of collections ${\to f}_{1}, \dots, {\to f}_{m}$ are linearly readable up to error $ϵ$ . A choice of ${\to r}_{i}$ that might immediately suggest itself is ${\to r}_{i} = {\to f}_{i}$ ; the features being linearly readable up to error $ϵ$ with these ${\to r}_{i}$ is just the condition that the ${\to f}_{i}$ have squared norm within $ϵ$ of $1$ and are pairwise almost orthogonal: more precisely, with $\cdot$ denoting the standard inner product, for all $i \neq j$ , we have $| f_{i} \cdot f_{j} | \leq ϵ$ . Supposing the f-vectors have (about) unit norm, is something like being almost orthogonal also necessary given some reasonable assumptions? Well, we could have the f-vectors be almost orthogonal w.r.t. the standard inner product in some other basis, and we could then clearly linearly read stuff after writing the vectors in this basis, but we could also compose the basis change and the readoff into just a linear readoff, so being almost orthogonal in any basis suffices for ${\to f}_{1}, \dots, {\to f}_{m}$ to be linearly readable. And being almost orthogonal in some other basis doesn’t imply being almost orthogonal in the usual basis; e.g., consider the case where all the basis vectors are almost equal in the usual basis. Is being almost orthogonal in some basis required though? Also no! Let ${\to f}_{1}, \dots, {\to f}_{m} \in R^{d}$ be sampled in bundles: by taking $m / ℓ = e^{d^{0.99}}$ independent uniformly random unit vectors ${\to g}_{1}, \dots, {\to g}_{m / ℓ} \in R^{d}$ and then generating a batch of $ℓ = e^{n^{0.99}}$ f-vectors ${\to f}_{j}$ from each ${\to g}_{i}$ (namely, those with $j = ℓ i + 1, \dots, ℓ (i + 1)$ ) by adding another independent uniformly random vector ${\to v}_{i j}$ of length (let’s say) $\frac{1}{log n}$ to it: ${\to f}_{j} = {\to g}_{i} + {\to v}_{i j}$ . One can (with very high probability) read off every resulting ${\to f}_{j}$ just fine using ${\to r}_{j} = log n \cdot {\to v}_{i j}$ up to error $ϵ = o (1)$ . But with very high probability, there’s no basis in which these ${\to f}_{j}$ are almost orthogonal almost unit vectors up to error $ϵ^{'} = 1 / 10$ — see the appendix to this appendix for a sketch of a proof.

Let’s finish this section by mentioning a few variations on the above. What if we require readoff vectors to have norm bounded by a constant? (For instance, maybe (explicit or implicit) weight regularization would make this requirement reasonable.) The construction above but with ${\to v}_{i j}$ of length $1 / 100$ , scaled back down by $\sqrt{\frac{10000}{10001}}$, still provides a counterexample. (If we require ${\to r}_{i}$ to have norm very close to $1$ , then we’re forced to pick ${\to r}_{i} \approx {\to f}_{i}$ , and then ${\to f}_{i}$ indeed have to be almost orthogonal according to the canonical inner product, but that’s sort of silly.) What if we replace the requirement that features are almost unit vectors in the new basis with the weaker one that the features have norm between some two particular nonzero constants? One can still use the proof in the appendix-appendix to show that there’s no such basis. What if we get rid of any norm requirement (other than that the vectors are nonzero — but this is implied by a change of basis anyway), just requiring almost orthogonality in the sense that for any $j \neq j^{'}$ , we have ${\to f}_{j} \cdot {\to f}_{j}^{'} \leq ϵ | | {\to f}_{j} | | | | {\to f}_{j}^{'} | |$ in the new basis? Note that this is actually a less natural requirement in our context than it might first seem — this is because it doesn’t imply that the properties are linearly readable. But anyway, (1) we’re quite certain that the above is still a counterexample, (2) we haven’t thought very much about how to adapt the proof in the appendix-appendix to show it is, (3) the rest of the argument would work as in the appendix-appendix if one could show that it’s unlikely there’s a $B^{- 1}$ with $σ_{1} / σ_{n} > n^{100}$ .

Linear readability and linear relations

If the values of features $f_{1}, \dots, f_{m}$ vary independently, then any linear relation between their feature vectors with coefficients that are not too uneven will render reading them off from activations impossible. More precisely, suppose that ${\to f}_{i} = \sum_{j} a_{j} {\to f}_{j}$ . Then if there were a corresponding readoff vector ${\to r}_{i}^{T}$ , we’d have ${\to r}_{i}^{T} {\to f}_{i} = \sum_{j} a_{j} {\to r}_{i}^{T} {\to f}_{j}$ , so $1 = O (ϵ (1 + \sum_{j} a_{j}))$ . Unless $\sum_{j} a_{j} = Ω (1 / ϵ)$ — the sum of coefficients is big — we have a contradiction. If we put a bound on the norm of ${\to r}_{i}$ and the norms of ${\to f}_{j}$ , then an approximate linear relation $f_{i} \approx \sum_{j} a_{j} {\to f}_{j}$ also provides a similar contradiction. Similarly, a linear relation on ${\to r}_{j} = \sum_{j} a_{j} {\to r}_{j}$ with small coefficients (or an approximate version, given bounds on the vectors $| | {\to r}_{i} | |$ and $| | {\to f}_{i} | |$ also yields a contradiction.

However, if the values of properties do not vary independently, then linear relations between readoffs are totally fine. For example, if we have atomic properties $f_{1} : X \to {0, 1}$ , $f_{2} : X \to {0, 1}$ , and the following two properties derived from them: $f_{3} = f_{1} \land f_{2}$ and $f_{4} = f_{1} \lor f_{2}$ , and the activation vector in the standard basis is $\to a (x) = (f_{1} (x), f_{2} (x), f_{1} (x) \land f_{2} (x))$ , then we can read off the four properties with $0$ error with ${\to r}_{1} = (1, 0, 0), {\to r}_{2} = (0, 1, 0), {\to r}_{3} = (0, 0, 1), {\to r}_{4} = (1, 1, - 1)$ even though there is a linear relation between these readoff vectors, because there is a corresponding linear relation between the properties. Though there’s some arbitrary-feeling choice here, and in fact the choice we make is perhaps not the most natural, we may also see it as a linear combination of $4$ corresponding features between which there is a linear relation — we may expand $(f_{1} (x), f_{2} (x), f_{1} (x) \land f_{2} (x)) = f_{1} (x) (2, 1, 0) + f_{2} (x) (1, 2, 0) + f_{3} (x) (- 1, - 1, 1) + f_{4} (x) (- 1, - 1, 0)$ . This merits more thought.

A bound on the number of linearly readable features

A simple restatement of the features being linearly $ϵ$ -readable is that, letting $F$ denote the $m \times n$ matrix whose rows are ${\to f}_{1}, \dots, {\to f}_{m}$ , there’s an $n \times m$ matrix $R$ such that $F R$ has $L^{\infty}$ distance at most $ϵ$ from the identity matrix. Given this translation, Theorem 1.1 here tells us that if ${\to f}_{1}, \dots, {\to f}_{m} \in R^{d}$ are linearly readable up to error $ϵ$ , then $m \leq e^{C ϵ^{2} log (\frac{1}{ϵ}) d}$ . Or see here for a neat proof of the same upper bound in the subcase where we force ${\to r}_{i} = {\to f}_{i}$ . And both bounds are tight up to the $log (1 / ϵ)$ factor in the exponent since a set of $e^{C ϵ^{2} d}$ random unit vectors is almost orthogonal with high probability — this provides some very weak sense in which linear readability doesn’t give more flexibility than almost-orthogonality.

Appendix to the appendix

Here’s a sketch of a proof that there is no basis in which construction provided above is almost-orthogonal (if you have a neater proof, let us know). We’re dropping arrows on vectors here. (Here, $f_{i}$ always denotes the vector.)

Let us consider what needs to be the case if there is a basis which makes the $f_{j}$ almost unit and almost orthogonal with parameter $ϵ^{'}$ . Let a linear map that takes a vector to its representation in such a basis be $B^{- 1}$ . We have ${max}_{v \in S^{n - 1}} | | B^{- 1} v | | = σ_{1}$ , the top singular value of $B^{- 1}$ , in fact with $B^{- 1} v_{1} = σ_{1} u_{1}$ in terms of the top respectively right and left singular vectors of $B^{- 1}$ . Up to replacing $ϵ^{'} \leftarrow 2 ϵ^{'}$ , we can always assume that the smallest singular value $σ_{n}$ is at least $ϵ^{'} / 100$ — this is because one can replace $B^{- 1}$ with a matrix with the same SVD but with singular values shifted up by $ϵ^{'} / 100$ — one can check that this does not affect dot products by more than $ϵ^{'}$ . Additionally, note that the max of the three numbers $\frac{| | B^{- 1} g_{i} | |}{| | g_{i} | |}$ and $\frac{| | B^{- 1} v_{i j} | |}{| | v_{i j} | |}$ and $\frac{| | B^{- 1} v_{i j^{'}} | |}{| | v_{i j^{'}} | |}$ (for some $j \neq j^{'}$ ) was ever within a factor of $\sqrt{log n}$ of the min of these three numbers, then $B^{- 1} f_{j}$ having almost unit norm would imply that $B^{- 1} g_{i}$ also has almost unit norm, and then one could derive a contradiction from the requirement that $(B^{- 1} f_{j}) \cdot (B^{- 1} f_{j^{'}}) = O (ϵ^{'})$ . It follows that be that for any $i$ , $\frac{| | B^{- 1} g_{i} | |}{| | g_{i} | |}$ is at least $\sqrt{log n}$ times larger than $\frac{| | B^{- 1} v_{i j} | |}{| | v_{i j} | |}$ for all but at most one index $j$ from its bundle. For this index, we still have that $\frac{| | B^{- 1} g_{i} | |}{| | g_{i} | |} \geq \frac{| | B^{- 1} v_{i j} | |}{| | v_{i j} | |} \frac{σ_{n}}{σ_{1}}$ It then follows that

$\frac{| | B^{- 1} g_{i} | |}{| | g_{i} | |} \geq {(\frac{σ_{n}}{σ_{1}} (log n)^{e^{n^{0.99}} / 2 - 1})}^{1 / e^{n^{0.99}}} {(\prod j \frac{| | B^{- 1} v_{i j} | |}{| | v_{i j} | |})}^{1 / e^{n^{0.99}}}$

Intuitively, this is saying that $B^{- 1}$ applies a systematically larger scaling to $v_{i j}$ than to $g_{i}$ .

However, one can use a pair of arguments using nets that with high probability, there is no matrix $B^{- 1}$ satisfying all these properties.

First, with high probability, there is no such matrix with $σ_{1} \geq n$ . This is because we can show that with high probability, for every such matrix, there is some $f_{j}$ with $| | B^{- 1} f_{j} | | \geq 2$ . Indeed, one can show that with high probability, for every unit vector $v$ at once, there is some $f_{j} = f_{j} (v)$ so that $f_{j} \cdot v \geq \frac{1}{\sqrt{n}}$ ; in particular, such a $f_{j}$ thus exists for the top right singular vector $v_{1}$ , and then expanding $f_{j}$ in the basis of right singular vectors easily gives $| | B^{- 1} f_{j} | | = Ω (\sqrt{n})$ .

A sketch of a proof that with high probability, for every vector on the sphere at once, there is an $f_{j}$ which is near it in this sense: before we sample the $f_{j}$ , we pick an appropriate net — for us, this will be a set on the sphere such that for each point on the sphere, some point on the net is closer than (let’s say) $ε = \frac{1}{n}$ to it. To construct such a net, keep adding points on the sphere arbitrarily, making sure that each point added has distance at least $ε$ to all previously added points, until we get stuck. In fact, we must get stuck after at most ${(\frac{2}{ϵ / 2})}^{n} = (4 n)^{n} \leq O (e^{2 n log n})$ points because balls of radius $\frac{ε}{2}$ around added points must be disjoint and contained in a ball around the origin of radius $2$ . When we get stuck, every point on the sphere has distance at most $ε = \frac{1}{n}$ to some chosen point, so we have a desired net with $O (e^{2 n log n})$ points. For a point in this net, the probability that no $f_{j}$ has dot product at least $\frac{2}{\sqrt{n}}$ with it is at most $c^{- e^{n^{0.99}}}$ for some $c < 1$ . As the size of the net is only singly exponential, so we can easily union-bound over the net to say that with high probability, for every point of the net, there is some corresponding $f_{j}$ with dot product at least $\frac{2}{\sqrt{n}}$ with that point of the net. If that happens, for any point $u$ on the sphere, we get that there is a $f_{j}$ with dot product at least $\frac{1}{\sqrt{n}}$ with it as well, because there’s a point of the net closer to $u$ than $\frac{1}{n}$ , let’s call this point $s$ , and there is a $f_{j}$ with $s \cdot f_{j} \geq \frac{2}{\sqrt{n}}$ , so $u \cdot f_{j} = s \cdot f_{j} + (u - s) \cdot f_{j} \geq \frac{2}{\sqrt{n}} - \frac{1}{n} \geq \frac{1}{\sqrt{n}}$

Secondly, with high probability, there is also no such matrix $B^{- 1}$ with $σ_{1} \leq n$ . In this case, we use a Frobenius norm $ϵ^{'} / 10000$ net in the set of all matrices with $σ_{1} \leq n$ and $σ_{n} \geq ϵ^{'} / 100$ . Since the entries of any such matrix are bounded by some polynomial in $n$ , a similar volume argument as the one in the previous paragraph applied to balls in a cube in $R^{n^{2}}$ shows that there exists such a net of size $exp (p o l y (n))$ . Since the Frobenius norm is an upper bound of the operator norm, this net also serves as an $ϵ^{'} / 10000$ net w.r.t. the operator norm. This guarantees that for every such matrix $M$ and any nonzero vector $v \in R^{n}$ , there is a net element $N$ with $\frac{| | M v | |}{| | v | |}$ differing from $\frac{| | N v | |}{| | v | |}$ by at most $1 %$ . We now consider $log ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ \frac{{(\prod_{i} \frac{| | N g_{i} | |}{| | g_{i} | |})}_{i}^{1 / e^{n^{0.99}}}}{{(\prod_{i} \prod_{j} \frac{| | N v_{i j} | |}{| | v_{i j} | |})}_{i}^{1 / e^{2 n^{0.99}}}} ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ = \frac{\sum_{i} log (\frac{| | N g_{i} | |}{| | g_{i} | |})}{e^{n^{0.99}}} - \frac{\sum_{i j} log (\frac{| | N v_{i j} | |}{| | v_{i j} | |})}{e^{2 n^{0.99}}}$ Each of these summands is between $log ϵ^{'} / 100 = log 1 / 500$ and $log n$ , so we can apply https://en.wikipedia.org/wiki/Hoeffding%27s_inequality to conclude that the probability of a deviation of $log ((log n)^{1 / 3})$ from the expected value of $0$ is less than $e^{- e^{n^{0.99}} / (100 {log}^{2} n)}$ . So this never happens for any matrix $N$ in the net by a union bound over the merely $exp (p o l y (n))$ matrices in the net. Since any matrix with $σ_{1} \leq n$ and $σ_{n} \geq ϵ^{'} / 100$ has a matrix $N$ in the net for which their respective expressions differ by at most $0.01$ , it follows that there’s no such matrix $B^{- 1}$ with $σ_{1} \leq n$ with

$\prod i \frac{| | B^{- 1} g_{i} | |}{| | g_{i} | |} \geq \prod i ⎛ ⎜ ⎜ ⎝ {(\frac{σ_{n}}{σ_{1}} (log n)^{e^{n^{0.99}} / 2 - 1})}^{1 / e^{n^{0.99}}} {(\prod j \frac{| | B^{- 1} v_{i j} | |}{| | v_{i j} | |})}^{1 / e^{n^{0.99}}} ⎞ ⎟ ⎟ ⎠$

Since there being a basis in which this set of vectors is almost orthogonal implies that one of the two things we’ve considered above happens, and each happens with probability $o (1)$ , one of them happening also has probability $o (1)$ . So w.h.p., neither happens — and so w.h.p., there’s no basis in which this set of vectors is almost orthogonal.

^
Kaarel and Jake would also be interested in distributing microgrants to such people if someone would like to fund this please get in touch
^
Up to a log factor in the number of neurons
^
In practice for the specific case of wheels and doors, the sum of these features would work similarly well. However, this is just an illustrative example of a boolean function. As we discuss in the body of the text, being able to compute any boolean function is much more expressive than only computing linear functions. Perhaps a better example specific to a transformer is the feature “will_smith” = “will@previous_token” AND “smith@this_token”.
^
In the sense of taking only a small number of inputs
^
These are called feature representation vectors, feature embedding vectors, and feature directions in Toy Models of Superposition, and feature embedding vectors in Polysemanticity and Capacity. We like the term feature vectors but this is used already to mean the input vector which stores features.
^
The precise formula is that for some constant $C$ , up to $exp (C ϵ^{2} d)$ random vectors will be $ϵ$ -almost-orthogonal with probability approaching $1$ .
^
Really, $ϵ$ and $ℓ$ could be functions of other parameters, but let’s ignore that.
^
In fact, we will abuse notation a bit in this paragraph by using $x$ to denote both a binary string and its input embedding, only distinguishing them with the use of an overarrow.
^
Although there are some subtleties here, and it’s not obvious that small $p$ always improves the worst-case interference, even though it does minimise the expected interference.
^
One might be able to get a better bound here, perhaps by using something sharper than a Chernoff bound, more appropriate for far tails of the binomial distribution with very small $p$ — we haven’t thought carefully about optimizing this error term.
^
assuming that one allows a fixed input of 1, which one can implement as an offset
^
See section 1.2 for a way to efficiently compute ANDs of multiple inputs in a single layer, which may dramatically improve the efficiency of the computation of suitable circuits]
^
Maybe it’s fine if some neurons misfire as long as the total signal on the $| S_{i} \cap S_{j} |$ neurons in a pairwise intersection beats the total noise? We think maybe this lets one do up to about $r d_{0}$ inputs per neuron, and one might get up to about $m = \sqrt{r} d_{0} \sqrt{d} \leq d_{0}^{3 / 2} \sqrt{d}$ input features this way. So this might get one a little further.
^
While this appears worse than U-AND in the regime in which U-AND works, it is actually not because the construction below also solves the U-AND task in that regime. There might be a way to interpolate between U-AND and this construction — we speculate on this in the open directions.
^
To see this, for example note that monomial decomposition in boolean algebra implies that any circuit can be written as a large XOR of multi-input ANDs; now a multi-input XOR can be written as a linear combination of AND circuits using a modified inclusion-exclusion. For a more geometrical picture, consider that a boolean circuit can be thought of as a complicated Venn Diagram with $k$ overlapping regions, with a 1 or a 0 assigned to each of the $2^{k}$ regions including the outside. To recreate a particular boolean function out of ANDs, start by choosing the fan-in-0 AND (a constant) to have a coefficient equal to the value of the function outside all circles. Then add in each fan-in-1 AND (just the variables) with coefficients that ensure that all the regions in just 1 circle have the correct value. Then add in the fan-in-2 ANDs with coefficients that fix the function value on pairwise intersections. Then fan-in-3 for the triple intersections, and so on, with the coefficients of the $2^{k}$ ANDs of fan-in up to $k$ each being constrained by exactly one region of the diagram
^
We haven’t carefully thought about which method is better in some more meaningful sense though. Both of these constructions work for choices of $k$ up to around $polylog (d_{0})$ , at which point the noise starts to become an issue.
^
The factor $\frac{1}{2 ℓ}$ can be replaced by any value $< \frac{1}{ℓ}$
^
Suppose that $w_{μ, ν}$ is a vector on $R^{r^{2}}$ such that the dot product $w_{μ, ν} \cdot Q (v) = q_{μ, ν} (v),$ for $q_{μ, ν} (v) = (v \cdot e_{μ}) (v \cdot e_{ν})$ the quadratic function. Note that we can choose $w_{μ, ν} = w_{ν, μ} .$ Then the linear readoff function $Φ_{i, j}$ is given by taking dot product with the readoff vector $w_{i, j} := \frac{1}{2} \sum (v_{i} \cdot e_{μ}) (v_{j} \cdot e_{ν}) w_{μ ν} .$
^
By distributivity, this expression has $s^{2}$ terms of the form $ϕ_{i, j} (v_{i^{'}}, v_{j^{'}})$ , all of which except possibly $ϕ_{i, j} (v_{i}, v_{j}) = 1$ are bounded by $2 ϵ$ , giving the result. But in fact, one can get a better bound by noting that $| ϕ_{i, j} (v_{i^{'}}, v_{j^{'}}) | < ϵ$ when $(i, j)$ and $(i^{'}, j^{'})$ do not share an index.
^
In fact, $O ((\frac{m + e}{e}))$ is sufficient
^
Note that the efficiency gain from universal keys is bounded by the size of the context window: for example, one can convert a transformer to an MLP at the cost of making the layers much wider, thus neutralizing the information asymmetry. However, in the asymptotic where the size of the context window goes to infinity, these methods do seem to asymptotically improve the expressivity of boolean circuits one can execute in a superpolynomial way compared to previously known methods
^
This paper provides a more careful analysis of the same topic. $V$ -information might also be relevant. But we’ve only skimmed each paper.
^
Or we can see it as precisely computing a family of functions which record the number inputs in a particular subset are present on the input, minus one.
^
Of course, there is really structure in this family of subsets — they come from intersections of larger subsets, meaning they can be specified more succinctly than this — the point we are making is precisely that it is natural to forget that structure in the superposition picture.
^
Note that if we insist that the output is normalised, then the maximum L2 distance of a unit vector from our target 1-hot vector, with individual entries differing by at most $e p s i l o n$ , is of order $e p s i l o n$ . In this case the two notions of successful reconstruction are aligned. One might think that the presence of layernorm in real models precisely normalises vectors in this way, but this is neglecting to remember that our target $(1, 0, 0, \dots)$ is only tacked onto the end of the architecture to demonstrate that all the AND features are linearly represented immediately after the ReLU. The part of our toy model that corresponds to the part of a neural network with layernorm would be the activation vector immediately after the ReLUs, which contains a sparse feature basis. Layernorm applied to this vector would not do much, and would not correspond to the final large vector being normalised.
^
Related.
^
Much like it’s not a very novel idea that a ReLU layer might compute boolean functions of features, we do not claim that the idea that the QK part of an attention head could check for one of some set of pairs of features is very novel, though we don’t know of this task having been made precise in the way we do before.
^
Nevertheless, we think that morally, the first notion is what’s needed — that there could be a version of this section which only uses a slightly stricter version of the first notion.
^
This method is slightly unsatisfactory because it doesn’t treat the row space and the column space equivalently. This can be solved by writing ${^W}_{Q K}$ as a sum of pure tensors using the SVD and including only the $d_{head}$ pure tensors with the highest singular values, which also has the advantage of being the best approximation to ${^W}_{Q K}$ (in the sense of Frobenius norm distance or operator norm distance), and therefore which will give us the best signal to noise ratio. The reason why we don’t do this here is because it is hard to reason about the distribution of singular values, and it doesn’t seem trivial to argue that the singular vectors are ‘independent’ of the f-vectors. We think that the details do work out even though we can’t prove it and that in practice, the optimal algorithm involves taking this best low-rank approximation of ${^W}_{Q K}$ instead of a random one. However, we expect that this only improves the signal to noise ratio (and hence the number of bigrams we can check for) by a constant factor, because all the singular values of a random gaussian matrix live at the same scale (see here). In more detail:
We take its SVD ${^W}_{Q K} = \sum_{j = 1}^{d_{resid}} σ_{j} {\to u}_{j} {\to v}_{j}^{T}$ , and we let the bilinear form be the best rank $d_{head}$ approximation of ${^W}_{Q K}$ , i.e., $W_{Q K} = \sum_{j = 1}^{d_{head}} σ_{j} {\to u}_{j} {\to v}_{j}^{T}$ .
Entries of $^W$ are a sum over $| P |$ products of two i.i.d. gaussian random variables. We don’t know how to say this rigorously (although we think this is the kind of thing which is easy to check experimentally), but we think that in the relevant range of $| P |$ (maybe let’s say $| P | = d_{resid} d_{head} / {log}^{2} d_{resid}$ ), the matrix $^W$ is pretty much distributed as a random matrix with i.i.d. gaussian entries. We’re probably not in the range where this becomes a trivial consequence of the multivariate CLT, because $| B |$ , the number of terms, will not be big compared to $d_{resid}^{2}$ , the number of entries. The singular values of gaussian matrices are understood well (e.g. see the article on the Pastur Distribution); the basic thing we’ll assume now (that we’re 98% sure is true) is that basically all the singular values of such a matrix live at the same scale, i.e. there is a size $s$ (that depends on $| B |$ and $d_{resid}$ ) such that all but the smallest $1 %$ of singular values are between $s / 1000$ and $s$ .
If we assume this, it becomes easy to understand the size of noise in our QK-circuit, i.e. to understand ${\to n}_{1}^{T} W_{Q K} {\to n}_{2} = {\to n}_{1}^{T} (\sum_{j = 1}^{d_{head}} σ_{j} {\to u}_{j} {\to v}_{j}^{T}) {\to n}_{2}$ in the case that ${\to n}_{1}, {\to n}_{2}$ are random unit vectors. This is a linear combination of a bunch of things (i.e., $σ_{j}$ ) of size roughly $s$ with coefficients (i.e., $({\to n}_{1} \cdot {\to u}_{j}) \cdot ({\to v}_{j} \cdot {\to n}_{2})$ ) which are roughly independent and have distributions which are symmetric around $0$ and which have size roughly $1 / d_{resid}$ . In particular, it has size on the order of $\frac{s \sqrt{d_{head}}}{d_{resid}}$ .
To find $s$ : Since the noise term ${\to n}_{1}^{T} {^W}_{Q K} {\to n}_{2} = {\to n}_{1}^{T} (\sum_{j = 1}^{d_{resid}} σ_{j} {\to u}_{j} {\to v}_{j}^{T}) {\to n}_{2} = {\to n}_{1}^{T} (\sum_{(i, j) \in P} {\to f}_{j} \otimes {\to f}_{i}) {\to n}_{2}$ has size on the order of $\frac{s \sqrt{d_{resid}}}{d_{resid}}$ but also on the order of $\frac{\sqrt{| P |}}{d_{resid}}$ , we have that $s$ is about $\frac{\sqrt{| P |}}{\sqrt{d_{resid}}}$ , and the noise is of order $\sqrt{\frac{d_{head} | P |}{d_{resid}^{3}}}$ . (There are also other ways to compute the scale of $s$ or the scale of the noise.)
As for the size of the signal: as in the main text we have ${\to a}_{t}^{T} {^W}_{Q K} {\to a}_{s} \approx 1$ . Assuming this signal ‘distributes nicely over the SVD’ (sketchiest step by far, but probably right for $m ≫ d_{resid}$ and another thing which would be easy to check with an experiment), i.e. given $1 \approx {\to a}_{s}^{T} W_{Q K} {\to a}_{t} = \sum_{j = 1}^{d_{resid}} σ_{j} {\to a}_{s}^{T} {\to u}_{j} {\to v}_{j}^{T} {\to a}_{t}$ , we can conclude ${\to a}_{s}^{T} W_{Q K} {\to a}_{t} \approx \frac{\sum_{j = 1}^{d_{head}} σ_{j}}{\sum_{j = 1}^{d_{resid}} σ_{j}}$ ; this is on the order of $\frac{d_{head}}{d_{resid}}$ given the fixed scale assumption from the previous paragraph. Also importantly, it is $\frac{d_{head}}{d_{resid}}$ times some constant independent of the pair (that can be computed by integrating the Pastur Distribution) — this means that the improvement the SVD gives over a random projection is only a constant amount. (We also wrote a bit of code before we understood how to figure this SVD thing out conceptually — it seems to work empirically as well.)]
^
Again, this asymmetry would not be present if we used the SVD instead.
^
Though this can be salvaged, e.g. with the language of arithmetic circuits from Appendix D.1 in Christiano et al.
^
Again, using a low-rank approximation given by the SVD is more natural, though again, it doesn’t look like it gives an improvement of more than a constant factor here.
^
More generally, we want our interpretability techniques not to fail silently, and to tell us how they are failing. We expect that if someone is able to get a good example of a task which involves computation that is truly in superposition throughout, this will be a good testbed for studying which interpretability techniques can be misleading. Can SAEs recover the correct AND features? Do analyses based on the neuron basis or SVD lead to spurious results?
^
For example, if layer $L$ has $f (L)$ pairwise AND nodes, (except for the first layer, which has input nodes) then if $l$ nodes are on in layer $L$ , (assuming the inputs to each AND are chosen independently uniformly at random) the expected number of nodes which are on in layer $L + 1$ is $f (L + 1) \cdot \frac{l}{f (L)} \frac{l - 1}{f (L)}$ . So we’d get steady-state behavior of the number of nodes which are on in expectation (this is a priori distinct from some actual convergence guarantee though; we’re just making it a martingale) iff $f (L + 1) \cdot \frac{k}{f (L)} \frac{k - 1}{f (L)} = k$ , so $f (L + 1) = \frac{f (L)^{2}}{k - 1}$
^
Assuming each feature is on roughly equally often, a double counting argument says that this is roughly the same as each feature only being active on at most about a particularly small fraction of all inputs: $\frac{| p_{j}^{- 1} (1) |}{D} \approx \frac{ℓ}{m} ≪ \frac{d}{m}$ .
^
Well, more precisely, you should maybe think of this $R^{d}$ as the dual space of the activation space $R^{d}$ , i.e., of each ${\to r}_{i}$ as a linear function on activation space, ${\to r}_{i} : R^{d} \to R$ .
^
We could also weaken this so that maybe we’re fine with some very small number of errors — of probe outputs outside this range. The story to follow a fortiori also holds with this weaker definition.
^
This is a worst-case bound; in nice cases, the typical error should be more like $\sqrt{k} ϵ$ .
^
Well, being linearly readable up to error $ϵ$ is already directly structure that might be helping us find f-vectors in practice — it seems plausible that this is related to sparse autoencoders with linearly computed coefficients making sense (compared to e.g. more canonical sparse coding methods) — though unclear if this can be squared with the ReLU in their hidden layer (or if that ReLU can be squared with this).

What links here?

Dmitry Vaintrob, jake_mendel and Kaarel

Jan 18, 2024, 9:06 PM

LW: 205 AF: 90

18 comments63 min readLW link

Interpretability (ML & AI)Superposition AI

Lucius Bushnaq Jan 18, 2024, 11:30 PM
58 points
9
Well. Damn.

As a vocal critic of the whole concept of superposition, this post has changed my mind a lot. An actual mathematical definition that doesn’t depend on any fuzzy notions of what is ‘human interpretable’, and a start on actual algorithms for performing general, useful computation on overcomplete bases of variables.

Everything I’ve read on superposition before this was pretty much only outlining how you could store and access lots of variables from a linear space with sparse encoding, which isn’t exactly a revelation. Every direction is a float, so of course the space can store about float precision to the $n$ -th power different states, which you can describe as superposed sparse features if you like. But I didn’t need to use that lens to talk about the compression. I could just talk about good old non-overcomplete linear algebra bases instead. The $\leq n$ basis vectors in that linear algebra description being the compositional summary variables the sparse inputs got compressed into. If basically all we can do with the ‘superposed variables’ is make lookup tables of them, there didn’t seem to me to be much need for the concept at all to reverse engineer neural networks. Just stick with the summary variables, summarising is what intelligence is all about.
If we can do actual, general computation with the sparse variables? Computations with internal structure that we can’t trivially describe just as well using $\leq n$ floats forming the non-overcomplete linear basis of a vector space? Well, that would change things.

As you note, there’s certainly work left to do here on the error propagation and checking for such algorithms in real networks. But even with this being an early proof of concept, I do now tentatively expect that better-performing implementations of this probably exist. And if such algorithms are possible, they sure do sound potentially extremely useful for an LLM’s job.

On my previous superposition-skeptical models, frameworks like the one described in this post are predicted to be basically impossible. Certainly way more cumbersome than this looks. So unless these ideas fall flat when more research is done on the error tolerance, I guess I was wrong. Oops.
Thane Ruthenis Jan 19, 2024, 7:40 AM
LW: 19 AF: 9
8
AF
Haven’t read everything yet, but that seems like excellent work. In particular, I think this general research avenue is extremely well-motivated.
Figuring out how to efficiently implement computations on the substrate of NNs had always seemed like a neglected interpretability approach to me. Intuitively, there are likely some methods of encoding programs into matrix multiplication which are strictly ground-truth better than any other encoding methods. Hence, inasmuch as what the SGD is doing is writing efficient programs on the NN substrate, it is likely doing so by making use of those better methods. And so nailing down the “principles of good programming” on the NN substrate should yield major insights regarding how the naturally-grown NN circuits are shaped as well.
This post seems to be a solid step in that direction!
LawrenceC Jan 18, 2024, 11:27 PM
LW: 15 AF: 7
1
AF
(I haven’t had the chance to read part 3 in detail, and I also haven’t checked the proofs except insofar as they seem reasonable on first viewing. Will probably have a lot more thoughts after I’ve had more time to digest.)
This is very cool work! I like the choice of U-AND task, which seems way more amenable to theoretical study (and is also a much more interesting task) than the absolute value task studied in Anthropic’s Toy Model of Superposition (hereafter TMS). It’s also nice to study this toy task with asymptotic theoretical analysis as opposed to the standard empirical analysis, thereby allowing you to use a different set of tools than usual.
The most interesting part of the results was the discussion on the universality of universal calculation—it reminds me of the interpretations of the lottery ticket hypothesis that claim some parts of the network happen to be randomly initialized to have useful features at the start.
Some examples that are likely to be boolean-interpretable are bigram-finding circuits and induction heads. However, it’s possible that most computations are continuous rather than boolean^[31].
My guess is that most computations are indeed closer to continuous than to boolean. While it’s possible to construct boolean interpretations of bigram circuits or induction heads, my impression (having not looked at either in detail on real models) is that neither of these cleanly occur inside LMs. For example, induction heads demonstrate a wide variety of other behavior, and even on induction-like tasks, often seem to be implementing induction heuristics that involve some degree of semantic content.
Consequently, I’d be especially interested in exploring either the universality of universal calculation, or the extension to arithmetic circuits (or other continuous/more continuous models of computation in superposition).
Some nitpicks:
The post would probably be a lot more readable if it were chunked into 4. The 88 minute read time is pretty scary, and I’d like to comment only on the parts I’ve read.
Section 2:
Two reasons why this loss function might be principled are
1. If there is reason to think of the model as a Gaussian probability model
2. If we would like our loss function to be basis independent
A big reason to use MSE as opposed to eps-accuracy in the Anthropic model is for optimization purposes (you can’t gradient descent cleanly through eps-accuracy).
Section 5:
4 How relevant are our results to real models?
This should be labeled as section 5.
Appendix to the Appendix:
Here, $f_i$ always denotes the vector.
[..]
with \[\sigma_1\leq n\) with
(TeX compilation failure)
- LawrenceC Jan 19, 2024, 1:04 AM
  15 points
  3
  Parent
  Also, here’s a summary I posted in my lab notes:
  A few researchers (at Apollo, Cadenza, and IHES) posted this document today (22k words, LW says ~88 minutes).
  
  They propose two toy models of computation in superposition.
  
  First, they posit a MLP setting where a single layer MLP is used to compute the pairwise ANDs of m boolean input variables up to epsilon-accuracy, where the input is sparse (in the sense that l < m are active at once). Notably, in this set up, instead of using O(m^2) neurons to represent each pair of inputs, you can instead use O(polylog(m)) neurons with random inputs, and “read off” the ANDs by adding together all neurons that contain the pair of inputs. They also show that you can extend this to cases where the inputs themselves are in superposition, though you need O(sqrt(m)) neurons. (Also, insofar as real neural networks implement tricks like this, this probably incidentally answers the Sam Mark’s XOR puzzle.)
  
  They then consider a setting involving the QK matrix of an attention head, where the task is to attend to a pair of activations in a transformer, where the first activation contains feature i and the second feature j. While the naive construction can only check for d_head bigrams, they provide a construction involving superposition that allows the QK matrix to approximately check for Theta(d_head * d_residual) bigrams (that is, up to ~parameter count; this involves placing the input features in superposition).
  
  If I’m understanding it correctly, these seem like pretty cool constructions, and certainly a massive step up from what the toy models of superposition looked like in the past. In particular, these constructions do not depend on human notions of what a natural “feature” is. In fact, here the dimensions in the MLP are just sums of random subsets of the input; no additional structure needed. Basically, what it shows is that for circuit size reasons, we’re going to get superposition just to get more computation out of the network.
- jake_mendel Jan 19, 2024, 6:35 PM
  LW: 6 AF: 3
  0
  AF Parent
  Thanks for the kind feedback!
  I’d be especially interested in exploring either the universality of universal calculation
  Do you mean the thing we call genericity in the further work section? If so, we have some preliminary theoretical and experimental evidence that genericity of U-AND is true. We trained networks on the U-AND task and the analogous U-XOR task, with a narrow 1-layer MLP and looked at the size of the interference terms after training with a suitable loss function. Then, we reinitialised and froze the first layer of weights and biases, allowing the network only to learn the linear readoff directions, and found that the error terms were comparably small in both cases.
  This figure is the size of the errors for $d = d_{0} = 100$ (which is pretty small) for readoffs which should be zero in blue and one in yellow (we want all these errors to be close to zero).
  This suggests that the AND/XOR directions were $ϵ$ -linearly readoffable at initialisation, but the evidence at this stage is weak because we don’t have a good sense yet of what a reasonable value of $ϵ$ is for considering the task to have been learned correctly: to answer this we want to fiddle around with loss functions and training for longer. For context, an affine readoff (linear + bias) directly on the inputs can read off $f_{1} \land f_{2}$ with $({\to f}_{1} + {\to f}_{2}) / 2 - 1 / 4$ , which has an error of $ϵ = 1 / 4$ . This is larger than all but the largest errors here, and you can’t do anything like this for XOR with affine readoff.
  
  After we did this, Kaarel came up with an argument that networks randomly initialised with weights from a standard Gaussian and zero bias solve U-AND with inputs not in superposition (although it probably can be generalised to the superposition case) for suitable readoffs. To sketch the idea:
  
  Let $W_{i}$ be the vector of weights from the $i$ th input to the neurons. Then consider the linear readoff vector with $k$ th component given by:
  $({- -- \to AND}_{i, j})_{k} = α (1_{W_{i k} > 0} \land 1_{W_{j k} > 0}) + β 1_{W_{i k} > 0} + γ 1_{W_{j k} > 0} + δ$
  where $1$ is the indicator function. There are 4 free parameters here, which are set by 4 constraints given by requiring that the expectation of this vector dotted with the activation vector has the correct value in the 4 cases $f_{i}, f_{j} \in {0, 1}$ . In the limit of large $d$ the value of the dot product will be very close to its expectation and we are done. There are a bunch of details to work out here and, as with the experiments, we aren’t 100% sure the details all work out, but we wanted to share these new results since you asked.
  A big reason to use MSE as opposed to eps-accuracy in the Anthropic model is for optimization purposes (you can’t gradient descent cleanly through eps-accuracy).
  We’ve suggested that perhaps it would be more principled to use something like $L^{p}$ loss for larger $p$ than 2, as this is closer to $ϵ$ -accuracy. It’s worth mentioning that we are currently finding that the best loss function for the task seems to be something like $L^{p}$ with extra weighting on the target values that should be $1$ . We do this to avoid the problem that if the inputs are sparse, then the ANDs are sparse too, and the model can get good loss on $L^{p}$ (for low $p$ ) by sending all inputs to the zero vector. Once we weight the ones appropriately, we find that lower values of $p$ may be better for training dynamics.
  or the extension to arithmetic circuits (or other continuous/more continuous models of computation in superposition)
  We agree and are keen to look into that!
  (TeX compilation failure)
  Thanks—fixed.
  What links here?
  - Computational Superposition in a Toy Model of the U-AND Problem by Adam Newgas (Mar 27, 2025, 4:56 PM; 17 points)
  - LawrenceC Jan 19, 2024, 7:55 PM
    LW: 2 AF: 2
    0
    AF Parent
    Fascinating, thanks for the update!
  - Hoagy Jan 22, 2024, 6:25 AM
    LW: 1 AF: 1
    0
    AF Parent
    How are you setting $p$ when $d_{0} = 100$ ? I might be totally misunderstanding something but ${log}^{2} (d_{0}) / \sqrt{d} \approx 2.12$ at $d_{0} = d = 100$ - feels like you need to push $d$ up towards like 2k to get something reasonable? (and the argument in 1.4 for using $\frac{1}{{log}^{2} d_{0}}$ clearly doesn’t hold here because it’s not greater than $\frac{{log}^{2} d_{0}}{d^{1 / k}}$ for this range of values).
    - jake_mendel Jan 23, 2024, 5:32 PM
      3 points
      1
      Parent
      So, all our algorithms in the post are hand constructed with their asymptotic efficiency in mind, but without any guarantees that they will perform well at finite $d$ . They haven’t even really been optimised hard for asymptotic efficiency—we think the important point is in demonstrating that there are algorithms which work in the large $d$ limit at all, rather than in finding the best algorithms at any particular $d$ or in the limit. Also, all the quantities we talk about are at best up to constant factors which would be important to track for finite $d$ . We certainly don’t expect that real neural networks implement our constructions with weights that are exactly 0 or 1. Rather, neural networks probably do a messier thing which is (potentially substantially) more efficient, and we are not making predictions about the quantitative sizes of errors at a fixed $d$ .
      In the experiment in my comment, we randomly initialised a weight matrix with each entry drawn from $N (0, 1)$ , and set the bias to zero, and then tried to learn the readoff matrix $R$ , in order to test whether U-AND is generic. This is a different setup to the U-AND construction in the post, and I offered a suggestion of readoff vectors for this setup in the comment, although that construction is also asymptotic: for finite $d$ and a particular random seed, there are almost definitely choices of readoff vectors that achieve lower error.
      FWIW, the average error in this random construction (for fixed compositeness; a different construction would be required for inputs with varying compositeness) is (we think) $Θ (1 / \sqrt{d})$ with a constant that can be found by solving some ugly gaussian integrals but I would guess is less than 10, and the max error is $Θ (log d / \sqrt{d})$ whp, with a constant that involves some even uglier gaussian integrals.
Neel Nanda Jan 22, 2024, 12:08 PM
LW: 12 AF: 9
7
AF
Interesting post, thanks for writing it!

I think that the QK section somewhat under-emphasises the importance of the softmax. My intuition is that models rarely care about as precise a task as counting the number of pairs of matching query-key features at each pair of token positions, and that instead softmax is more of an “argmax-like” function that finds a handful of important token positions (though I have not empirically tested this, and would love to be proven wrong!). This enables much cheaper and more efficient solutions, since you just need the correct answer to be the argmax-ish.

For example, ignoring floating point precision, you can implement a duplicate token head with $d_{h e a d} = 2$ and arbitrarily high $d_{v o c a b}$ . If there are $n$ vocab elements, map the $m$ th query and key to the point $\frac{m}{n}$ of the way round the unit circle. The dot product is maximised when they are equal.

If you further want the head to look at a resting position unless the duplicate token is there, you can increase $d_{h} e a d = 3$ , and have a dedicated BOS dimension with a score of $1 - ϵ$ , so you only get a higher score for a perfect match. And then make the softmax temperature super low so it’s an argmax.
- jake_mendel Jan 24, 2024, 12:54 AM
  LW: 9 AF: 6
  5
  AF Parent
  Thanks for the comment!
  In more detail:
  In our discussion of softmax (buried in part 1 of section 4), we argue that our story makes the most sense precisely when the temperature is very low, in which case we only attend to the key(s) that satisfy the most skip feature-bigrams. Also, when features are very sparse, the number of skip feature bigrams present in one query-key pair is almost always 0 or 1, and we aren’t trying to super precisely track whether its, say, 34 or 35.
  I agree that if softmax is just being an argmax, then one implication is that we don’t need error terms to be $o (1)$ , instead, they can just be somewhat less than 1. However, at least in our general framework, this doesn’t help us beyond changing the log factor in the tilde inside $~ Θ (d_{head} d_{resid}$ ). There still will be some log factor because we require the average error to be $o (1)$ to prevent the worst-case error being greater than 1. Also, we may want to be able to accept ‘ties’ in which a small number $> 1$ of token positions are attended to together. To achieve this (assuming that at most one SFB is present for each QK pair for simplicity) we’d want the variation in the values which should be 1 to be much smaller than the gap between the smallest value which should be 1 and the largest value which should be 0.
  A few comments about your toy example:
  To tell a general story, I’d like to replace the word ‘token’ with ‘feature’ in your construction. In particular, I might want to express what the attention head does using the same features as the MLP. The choice of using tokens in your example is special, because the set of features {this is token 1, this is token 2, …} are mutually exclusive, but once I allow for the possibility that multiple features can be present (for example if I want to talk in terms of features involved in MLP computation), your construction breaks. To avoid this problem, I want the maximum dot product between f-vectors to be at most 1/(the maximum number of features that can be present at once). If I allow several features to be present at once, this starts to look like an $ϵ$ -orthogonal basis again. I guess you could imagine a case where the residual stream is divided into subspaces, and inside each subspace is a set of mutually exclusive features (à la tegum products of TMS). In your picture, there would need to be a 2d subspace allocated to the ‘which token’ features anyway. This tegum geometry would have to be specifically learned — these orthogonal subspaces do not happen generically, and we don’t see a good reason to think that they are likely to be learned by default for reasons not to do with the attention head that uses them, even in the case that there are these sets of mutually exclusive features.
  It takes us more than 2 dimensions, but in our framework, it is possible to do a similar construction to yours in $O (log (m))$ dimensions assuming $m$ random token vectors (ie without the need for any specific learned structure in the embeddings for this task): simply replace the rescaled projection matrix $R = \sqrt{\frac{d_{resid}}{d_{head}}} P_{d_{head}}$ with $R^{'} = \sqrt{\frac{d_{resid}}{n}} P_{n}$ where $n$ is $O (log (m))$ and $P_{n}$ is a projection matrix to a $n$ -dimensional subspace. Now, with high probability, each vector has a larger dot product with its own projection than another vector’s projection (we need $n$ to be this large to ensure that projected vectors all have a similar length). Then use the same construction as in our post, and turn the softmax temperature down to zero.
  - jake_mendel Jan 29, 2024, 8:42 AM
    LW: 8 AF: 5
    1
    AF Parent
    Someone suggested this comment was inscrutable so here’s a summary:
    
    I don’t think that how argmax-y softmax is being is a crux between us—we think our picture makes the most sense when softmax acts like argmax or top-k so we hope you’re right that softmax is argmax-ish. Instead, I think the property that enables your efficient solution is that the set of features ‘this token is token (i)’ is mutually exclusive, ie. only one of these features can activate on an input at once. That means that in your example you don’t have to worry about how to recover feature values when multiple features are present at once. For more general tasks implemented by an attention head, we do need to worry about what happens when multiple features are present at the same time, and then we need the f-vectors to form a nearly orthogonal basis and your construction becomes a special case of ours I think.
Lucius Bushnaq Feb 6, 2024, 12:54 PM
4 points
0
Having digested this a bit more, I’ve got a question regarding the noise terms, particularly for section 1.3 that deals with constructing general programs over sparse superposed variables.
Unfortunately, since the ${\to f}_{1}, \dots, {\to f}_{m}$ are random vectors, their inner product will have a typical size of $1 / \sqrt{d_{0}}$ . So, on an input which has no features connected to neuron $i$ , the preactivation for that neuron will not be zero: it will be a sum of these interference terms, one for each feature that is connected to the neuron. Since the interference terms are uncorrelated and mean zero, they start to cause neurons to fire incorrectly when $Θ (d_{0})$ neurons are connected to each neuron. Since each feature is connected to each neuron with probability $p = \frac{{log}^{2} d_{0}}{\sqrt{d}})$ this means neurons start to misfire when $m = ~ Θ (d_{0} \sqrt{d})$ ^[13].
It seems to me that the assumption of uncorrelated errors here is rather load-bearing. If you don’t get uncorrelated errors over the inputs you actually care about, you are forced to scale back to connecting only $\sqrt{d_{0}}$ features to every neuron, correct? And the same holds for the construction right after this one, and probably most of the other constructions shown here?
And if you only get $\sqrt{d_{0}}$ connected features per neuron, you scale back to only being able to compute $| E | = ~ Θ (\sqrt{d_{0}} d)$ arbitrary AND gates per layer, correct?
Now, the reason these errors are ‘uncorrelated’ is that the features were embedded as random vectors in our layer space. In other words, the distributions over which they are uncorrelated is the distribution of feature embeddings and sets of neurons chosen to connect to particular features. So for any given network, we draw from this distribution only once, when the weights of the network are set, and then we are locked into it.
So this noise will affect particular sets of inputs strongly, systematically, in the same direction every time. If I divide the set of features into two sets, where features in each half are embedded along directions that have a positive inner product with each other^[1], I can’t connect more than $\sqrt{d_{0}}$ from the same half to the same neuron without making it misfire, right? So if I want to implement a layer that performs $| E | = ~ Θ (d_{0} d)$ ANDs on exactly those features that happen to be embedded within the same set, I can’t really do that. Now, for any given embedding, that’s maybe only some particular sets of features which might not have much significance to each other. But then the embedding directions of features in later layers depend on what was computed and how in the earlier layers, and the limitations on what I can wire together apply every time.
I am a bit worried that this and similar assumptions about stochasticity here might turn out to prevent you from wiring together the features you need to construct arbitrary programs in superposition, with ‘noise’ from multiple layers turning out to systematically interact in exactly such a way as to prevent you from computing too much general stuff. Not because I see a gears-level way this could happen right now, but because I think rounding off things to ‘noise’ that are actually systematic is one of these ways an exciting new theory can often go wrong and see a structure that isn’t there, because you are not tracking the parts of the system that you have labeled noise and seeing how the systematics of their interactions constrain the rest of the system.
Like making what seems like a blueprint for perpetual motion machine because you’re neglecting to model some small interactions with the environment that seem like they ought not to affect the energy balance on average, missing how the energy losses/gains in these interactions are correlated with each other such that a gain at one step immediately implies a loss in another.
Aside from looking at error propagation more, maybe a way to resolve this might be to switch over to thinking about one particular set of weights instead of reasoning about the distribution the weights are drawn from?
1. ^
  E.g. pick some hyperplanes and declare everything on one side of all of them to be the first set.
- Lucius Bushnaq Feb 9, 2024, 2:16 PM
  3 points
  0
  Parent
  Thinking the example through a bit further: In a ReLU layer, features are all confined to the positive quadrant. So superposed features computed in a ReLU layer all have positive inner product. So if I send the output of one ReLU layer implementing $n^{2}$ AND gates in superposition directly to another ReLU layer implementing another $n^{2}$ ANDs on a subset of the outputs of that previous layer^[1], the assumption that input directions are equally likely to have positive and negative inner products is not satisfied.
  Maybe you can fix this with bias setoffs somehow? Not sure at the moment. But as currently written, it doesn’t seem like I can use the outputs of one layer performing a subset of ANDs as the inputs of another layer performing another subset of ANDs.
  
  EDIT: Talked it through with Jake. Bias setoff can help, but it currently looks to us like you still end up with AND gates that share a variable systematically having positive sign in their inner product. Which might make it difficult to implement a valid general recipe for multi-step computation if you try to work out the details.
  1. ^
    A very central use case for a superposed boolean general computer. Otherwise you don’t actually get to implement any serial computation.
Max Hennick Jan 19, 2024, 11:21 AM
3 points
1
Really like this post. I had actually come to a very basic version of some of thus but not nearly as in-depth. Essentially it was realizing the Relu-Activated MLPs could be modeled as c-semirings and thus an information algebra. Some of the very basic concepts of superposition seemed to fall out of that. Would love to relate the work you guys have done to that, as your framework seems to answer some very deep questions.
keith_wynroe Jan 19, 2024, 7:39 PM
2 points
0
This looks really cool! Haven’t digested it all yet but I’m especially interested in the QK superposition as I’m working on something similar. I’m wondering what your thoughts are on the number of bigrams being represented by a QK circuit not being bounded by interference but by its interaction with the OV circuit. IIUC it looks like a head can store a surprising number of d_resid bigrams, but since the OV circuit is only a function of the key, then having the same key feature be in a clique with a large number of different query features means the OV-circuit will be unable to differentially copy information based on which bigram is present. I don’t think this has been explored outside of toy models from Anthropic though
Jerry Ding Nov 4, 2024, 7:02 AM
1 point
0
It’s really interesting that (so far it seems) the quadratic activation can achieve the universal AND almost exponentially more efficiently than the ReLU function.
It seems plausible to me that the ReLU activation can achieve the same effect to approximate a quadratic function in a piecewise way. From the construction it seems that each component of the $r^{2}$ space is the sum of at most 2 random variables, and it seems like when you add a sparse combination of a large number of nearly orthogonal vectors, each component of the output would be approximately normally distributed. So each component could be pretty easily bounded at high probability, and a quadratic function can be approximated within the bounds as a piecewise linear function. I’m not sure how the math on the error bounds works for this, but it sounds plausible to me that the error using the piecewise approximation is low enough to accurately calculate the boolean AND function.
Also I wonder if it’s possible to find experimental evidence from trained neural networks that it is using ReLU to implement the $x^{2}$ function in a piecewise way like this? Basically, after training a ReLU network to compute the AND of all pairs of inputs, we check whether the network contains repeated implementations of a circuit that flip the sign of negative inputs and increases the output at a higher slope when the input is of large enough magnitude. Though the duplication of inputs before the ReLU and the recombination would be linear operations that get mixed into the linear embedding and linear readout matrices, making it confusing to find which entries correspond to which… Maybe it can be done by testing individual boolean variables at a time. If a trained network is doing this, then it would be evidence that this is how ReLU networks achieve exponential size circuits.
Daniel Tan Apr 27, 2024, 3:00 PM
1 point
0
This work is very exciting to me, and I’m curious to hear the authors’ thoughts on whether we could verify specific predictions made by this model in real models.
- For example, the proposed U-AND operator—do we expect this to occur in real LLMs, and could we try to find evidence of this by applying mech interp to carefully-chosen toy models?
I have a more detailed write-up on model organisms of superposition here: https://docs.google.com/document/d/1hwI30HNNB2MkOrtEzo7hppG9X7Cn7Xm9a-1LBqcttWc/edit?usp=sharing
Would love to discuss this more!
Review Bot Mar 1, 2024, 12:48 AM
1 point
0
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

Toward A Mathematical Framework for Computation in Superposition

What kind of document is this?

1 minute summary

10 minute summary

Structure of the Post

Notation and Conventions

1 The Universal AND

1.1 The U-AND task

1.2 The U-AND construction

U-AND allows for arbitrary XORs to be efficiently calculated

1.3 Handling inputs in superposition: sparse boolean computers

Construction details for inputs in superposition

A construction which computes a subset of ANDs of inputs in superposition

1.4 ANDs with many inputs: computation of small boolean circuits in a single layer

The U-AND construction for higher fan-in

A clever choice of density can give us all the fan-ins at once

Extending the targeted superpositional AND to other fan-ins

1.5 Improved Efficiency with a Quadratic Nonlinearity

Construction Details

Scaling and comparison with ReLU activations

Generalizations

1.6 Universal Keys: an application of parallel boolean computation

2 U-AND: discussion

Aren’t the ANDs already kinda linearly represented in the U-AND input?

Aren’t the ANDs already kinda nonlinearly represented in the U-AND input?

Does our U-AND construction really demonstrate MLP superposition?

Is universal calculation generic?

What are the implications of using ϵ-accuracy? How does this compare to behavior found by minimizing some loss function?

3 The QK part of an attention head can check for many skip feature-bigrams, in superposition

3.1 The skip feature-bigram checking task

3.2 A superposition-free algorithm

3.3 Checking for a structured set of skip feature-bigrams with activation superposition

3.4 Checking for a smaller unstructured set of feature pairs in superposition

3.5 Copy-checker heads and structure-exploiting algorithms

Generalization as a limit of memorization

4 QK: discussion

Where does softmax fit in?

Unknown unknowns

Does our QK construction really demonstrate superposition?

4 How relevant are our results to real models?

5 Open directions /​ what we’re thinking about now

The OV circuit

Specifying concrete use cases

Genericity questions

Reverse-engineering

Understanding errors

Clarifying the model of computation

Potential reframings

How applicable are our setups to the real world?

Acknowledgments

Attributions

Appendix: a note on linear readoffs, linear combinations, and almost orthogonality

The structure of activation vectors

Linear readability and its consequences

Linear readability and linear relations

A bound on the number of linearly readable features

Appendix to the appendix

4 How relevant are our results to real models?

What are the implications of using $ϵ$ -accuracy? How does this compare to behavior found by minimizing some loss function?

5 Open directions / what we’re thinking about now