Summary & Motivation

This post is a continuation and clarification of Circuits in Superposition: Compressing many small neural networks into one. That post presented a sketch of a general mathematical framework for compressing different circuits into a network in superposition. On closer inspection, some of it turned out to be wrong, though. The error propagation calculations for networks with multiple layers were incorrect. With the framework used in that post, the errors blow up too much over multiple layers.

This post presents a slightly changed construction that fixes those problems, and improves on the original in some other ways as well.^[1]

By computation in superposition we mean that a network represents features in superposition and performs more computations with them than it has neurons, across multiple layers. Having better models of this is important for understanding how and even if networks use superposition, which in turn is important for mech-interp in general.

Performing computation in superposition over multiple layers introduces additional noise compared to just storing features in superposition^[2]. This restricts the amount and type of computation that can be implemented in a network of a given size, because the noise needs to be reduced or suppressed to stay smaller than the signal.

Takeaways

Our setup in this post (see the Section Construction for details) is as follows:

We have $T$ small circuits, each of which can be described as a $d$ -dimensional multilayer perceptron (MLP) with $L$ layers.
We have one large $D$ -dimensional MLP with $L$ layers, where $D > d$ , but $D < T d$ . So we can’t just dedicate $d$ neurons in the large MLP to each circuit.
We embed all $T$ circuits into the large network, such that the network approximately implements the computations of every circuit, conditional on no more than $z << T$ circuits being used on any given forward pass.

The number of circuits we can fit in scales linearly with the number of network parameters

Similar to the previous post, we end up concluding that the total number of parameters in the circuits must be smaller than the number of parameters in the large network. This result makes a lot of intuitive sense, since the parameters determine the maximum amount of information a network can possibly store.^[3]

More specifically, we find that the term $\sqrt{\frac{z T d^{2}}{D^{2}}}$ needs to be small for the average errors on the computations of individual circuits to stay smaller than the signal.

Here $T$ is the total number of circuits, $d$ is the width of each small circuit, $D$ is the layer width of the large network and $z$ is the number of circuits active on a given forward pass.

We also try to estimate worst-case errors, and conclude that the term $\sqrt{\frac{z^{2} log (T) T d^{2}}{D^{2}}}$ needs to be small for the error on every individual circuit to be smaller than the signal for every network input.

This gives us an approximate upper bound on the maximum number of $d$ -dimensional circuits we can fit into a network with $D$ neurons per layer:^[4]

\begin{matrix} T_{max} = ˜ O (\frac{1}{z^{2}} \frac{D^{2}}{d^{2}}) \\ (0.1) \end{matrix}

If you only remember one formula from this post, let it be that one.

This is much smaller than the number of $d$ -dimensional features we can store in superposition in a layer of width $D$ , if we don’t intend to use them for calculations within the network. That number is^[5]

\begin{matrix} T_{max storage} = O (\frac{1}{d} e^{\frac{D}{8 z d}}) \\ (0.2) \end{matrix}

So, while storage capacity scales exponentially with $D$ , capacity for computation only scales quadratically.

Each circuit will only use a small subset of neurons in the larger network

For this construction to work, each circuit can only be using a small fraction of the large network’s neurons.

I, Linda, expect this to be true more generally. I think basically any construction that achieves computation in superposition, across multiple layers, in the sense we mean here will (approximately) have this property. My reasons for thinking this are pretty entangled with details of the error propagation math, so I’ve relegated them to the Discussion section.

Implications for experiments on computation in superposition

The leading source of error in this construction is signals from the active circuits (used on the forward pass) bleeding into currently inactive circuits that shouldn’t be doing anything. This bleed-over then enters back into the active circuits as noise in the next layer.

This means that the biggest errors don’t appear until layer 2^[6]. This is important to keep in mind for experimental investigations of computation in superposition, e.g. when training toy models. If your network only has one computational layer, then it doesn’t have to implement a way to reduce this noise.^[7]

Reality really does have a surprising amount of detail

To make sure the math was actually really correct this time around, Linda coded up a little model implementing some circuits in superposition by hand.

Naturally, while doing this, she found that there were still a bunch of fiddly details left to figure out how to make circuits in superposition actually work in real life, even on a pretty simple example, because the math makes a bunch of vague assumptions about the circuits that turn out to be important when you actually get down to making things work in practice.

The math presented in this post won’t deal with those fiddly details. It is intended to be a relatively simple high level description of a general framework. E.g, we assume that individual circuits have some level of noise robustness around values representing ‘inactive’ for that circuit, without worrying about how it’s achieved.

So, in actual practice, the details of this construction may need some adjustment depending on how exactly individual circuits implement their noise robustness, and whether any of them are doing similar things.^[8]

A post with the hand coded model and the fiddly details should be coming out “ ${Soon}^{TM}$ ”.

Construction

The construction in this post has some significant differences from the previous one.

To simplify things, the networks here don’t have a residual stream, they’re just deep MLPs.^[9] We have one large MLP with $L$ layers, neuron dimension $D$ , activation vectors $A^{l} \in R^{D}$ , and weight matrices $W^{l} \in R^{D \times D}$ .

\begin{matrix} A^{l} = R e L U_{D} (W^{l} A^{l - 1}) for l \geq 1 \\ (1.1) \end{matrix}

We also have $T$ circuits, indexed by $0, \dots, T - 1$ , each described by a small MLP with $L$ layers, neuron dimension $d$ , activations vectors $a_{t}^{l} \in R^{d}$ , and weight matrices $w_{t}^{l} \in R^{d \times d}$ .

\begin{matrix} a_{t}^{l} = R e L U_{d} (w_{t}^{l} a_{t}^{l - 1}) for l \geq 1, t = 0, \dots, T - 1 \\ (1.2) \end{matrix}

Our goal is to figure out a construction for the weight matrices $W^{l}$ , which embeds the circuits into the network, such that the outputs of each circuit can be read-out from the final output of the network $A^{L}$ with linear projections, up to some small error.^[10]

Assumptions

For this construction to work as intended, we need to assume that:

Only $z ≪ \frac{D}{d}$ circuits can be active on any given forward pass.
Small circuits are robust to noise when inactive. I.e. a small deviation to the activation value of an inactive circuit applied in layer $l$ will not change the activation value of that circuit in layer $l + 1$ .^[11]
If a circuit is inactive, all of its neurons have activation value zero. I.e. $a_{t}^{l} = 0$ if circuit $t$ is inactive.
The entries of the weight matrices $w_{t}^{l}$ for different circuits in the same layer are uncorrelated with each other.

Assumption 1 is just the standard sparsity condition for superposition.

Assumption 2 is necessary, but if it is not true for some of the circuits we want to implement, we can make it true by modifying them slightly, in a way that doesn’t change their functionality.^[12] How this works will not be covered in this post though.

Assumptions 3 and 4 are not actually necessary for something similar to this construction to work, but without them the construction becomes more complicated. The details of this are also beyond the the scope of this post.

Embedding the circuits into the network

The important takeaways from this section are Equations (1.11) and (1.13)-(1.14), which we will make use of in the Error Calculation section. If you’re happy with these and don’t think they require further motivation, then you don’t need to read the rest of this section.

Remember that the circuit weights $w_{t}^{l}$ and their activation vectors $a_{t}^{l}$ are handed to us and we can’t change them, but we are free to choose the weights $W^{l}$ of the network to be whatever we like. We also assume that we get to choose how to linearly embed the input vectors of the circuits $a_{t}^{0}$ into the input vector of the network $A^{0}$ at the start.

To help with the embedding we will introduce:

Embedding matrices $E_{t}^{l} \in R^{D \times d}$ for each circuit $t$ in each layer $l \geq 1$
Unembedding matrices $U_{t}^{l} \in R^{d \times D}$ for each circuit $t$ in each layer $l \geq 1$ .

Our goal is to calculate Equation (1.2) using the network, which (due to our choice of $U_{t}^{l}$ and $E_{t}^{l}$ , see next section) can be re-expressed as

\begin{matrix} a_{t}^{l} = R e L U_{d} (w_{t}^{l} a_{t}^{l - 1}) = U_{t}^{l} R e L U_{D} (E_{t}^{l} w_{t}^{l} a_{t}^{l - 1}) for l \geq 1 \\ (1.3) \end{matrix}

We approximate this as

\begin{matrix} a_{t}^{l} \approx U_{t}^{l} A^{l} for l \geq 1 \\ (1.4) \end{matrix}

and

\begin{matrix} A^{l} \approx R e L U_{D} (\sum t E_{t}^{l} w_{t}^{l} a_{t}^{l - 1}) for l \geq 1 \\ (1.5) \end{matrix}

If we combine Equations (1.4) and (1.5) while pretending^[13] they are exact relations, we get

\begin{matrix} A^{l} = R e L U_{D} (\sum t E_{t}^{l} w_{t}^{l} U_{t}^{l - 1} A^{l - 1}) for l \geq 2 \\ (1.6) \end{matrix}

If we combine that with Equation (1.1) the network weights $W^{l}$ for $l \geq 2$ are now defined via the embedding and unembedding matrices as

\begin{matrix} W^{l} = \sum t E_{t}^{l} w_{t}^{l} U_{t}^{l - 1} for l \geq 2 \\ (1.7) \end{matrix}

You might notice that this leaves $W^{1}$ undefined, and that there are no embedding and unembedding matrices for $l = 0$ . That’s because layer zero is a bit special.

Layer 0

There are no embedding and unembedding matrices for layer 0, because we can just skip ahead and use our free choice of how to linearly embed $a_{t}^{0}$ into $A_{t}^{0}$ to implement the first matrix multiplications in each circuit $w_{t}^{1}$ without any interference errors.

We choose

\begin{matrix} W^{1} = I and A^{0} = \sum t E_{t}^{1} w_{t}^{1} a_{t}^{0} \\ (1.8) \end{matrix}

which gives us:

\begin{matrix} A^{1} = R e L U_{D} (\sum t E_{t}^{1} w_{t}^{1} a_{t}^{0}) \\ (1.9) \end{matrix}

I.e, (1.4) is exactly true in the first layer. Down in the Error Calculation section, this will have consequences for which layer each error term first shows up in.

Maybe you think this is sort of cheating. Perhaps it is, but a model can train to cheat like this as well. That’s part of the point we want to make in this post: Having more than one layer makes a difference. From layer 2 onward, this kind of thing doesn’t work anymore. We no longer gets to assume that every feature always starts embedded exactly where we want it. We have to actually compute on the features in a way that leaves every intermediary result embedded such that later layers can effectively compute on them in turn.

Constructing the Embedding and Unembedding matrices

In this subsection, we first lay out some requirements for the embedding matrices $E_{t}^{l}$ and unembedding matrices $U_{t}^{l}$ , then describe an algorithm for constructing them to fulfil those requirements.

The main takeaway here is that we can construct $E_{t}^{l}$ and $U_{t}^{l}$ such that they have the properties described in the final subsection. If you are happy to just trust us on this, then you don’t need to read the rest of this section.

Remember that each neuron in the large network will be used in the embeddings of multiple small circuit neurons. This is inevitable because the total numbers of neurons across all small circuits is larger than the number of neurons in the large network, $d T > D$ , as per assumption 1.

Requirements

Neurons from the same circuit should be embedded into non-overlapping sets of neurons in the large network. We want this because neurons in the same circuit will tend to coactivate a lot, so reducing interference between them is especially important.
Neurons from different small circuits should be embedded into sets of neurons in the large network that have at most overlap one. So, no pair of circuit neurons shares more than one network neuron. This ensures that there is a hard bound on how bad errors from interference between different circuits can get.
$E_{t}^{l}$ and $U_{t}^{l}$ should only contain non-negative values. This ensures that the embedding does not change the sign of any circuit neuron pre-activations, so that the ReLUs still work correctly.
The embedding should distribute circuit neurons approximately evenly over the network neurons. Otherwise we’re just wasting space.

Our construction will satisfy these requirements and redundantly embed each circuit neuron into a different subset of $S > 1$ network neurons. We call S the embedding redundancy.

Generally speaking, larger values of $S$ are good, because they make each circuit more robust to worst-case errors from interference with other circuits. However, our requirement 2 creates an upper bound on $S$ , because the more network neurons we assign to each circuit neuron, the harder it becomes to ensure that no two circuit neurons share more than one network neuron.^[14]

Our allocation scheme is based on prime number factorisation. We promise it’s not as complicated as that description may make it sound. The details of how it works are not important for understanding the rest of this post though, so feel free to skip it if you’re not interested.

Step 1

First, we split the network neurons (for each layer) into $d$ sets of $\frac{D}{d}$ neurons^[15]. The first neuron of each circuit will be embedded in the first of these sets, the second neuron of each circuit will be embedded into the second set and so on, with the $d$ -th neuron of each circuit embedded into the $d$ -th set of network neurons.

So next, we need to describe how to embed each set of $T$ circuit neurons into a set of $\frac{D}{d}$ network neurons.

Step 2

The following algorithm allocates $S > 1$ out of $\frac{D}{d}$ large neurons to each small circuit, while ensuring that each pair of circuits shares at most one such allocation, and that the allocations are fairly evenly spread out over the $\frac{D}{d}$ neurons.

The allocation algorithm:

Each small circuit neuron is assigned to $S$ large network neurons that are spaced step neurons apart. The set of possible_steps is chosen such that no two different neuron_allocations will overlap more than once. The set of possible_steps is based on prime factors.

First, we define a set of possible_steps^[16] and a function that draws a new instance from that set:

\begin{matrix} P_{S} := {p \in N | smallest prime factor of p is larger than S} \\ (a.1) \end{matrix}

possible_steps:={step=pSn | p∈PS ; n∈N0 ; pSn(S−1)≤Dd}(a.2)

Here, $N$ denotes the natural numbers, $N = {1, 2, 3 \dots}$ , and $N_{0}$ denotes the natural numbers including zero, $N_{0} = {0, 1, 2, 3 \dots}$ .

Next we chose one step from possible_steps. We use this to generate approximately $\frac{D}{S d}$ non-overlapping neuron_allocations, where each neuron_alocation consists of S subsequent neurons where each one is step neurons away from the previous one.

When we can’t generate more non-overlapping neuron_allocations from our first step we draw a new step from possible_steps. Repeat until we have $T$ neuron allocations, i.e, one for each small circuit.

Pseudo code:

next_step = function that generates a new element from possible_steps when called

step = next_step()
shift = 0
start = shift
current_small_circuit = 0
while current_small_circuit < T
	neuron_allocation[current_small_circuit] = [
				start, start + step, ..., start + (S-1)*step]
	start += S*step^{[17][17][16][17]}^[17]

	if start + (S-1)*step >= D/d:
		shift += 1
		if shift == step:
			step = next_step()
			shift = 0
		start = shift	
	current_small_circuit+=1

Why it works:

This will make sure no neuron_allocations share more than one neuron because^[18]

i×stepx≠j×stepyfor⎧⎪⎨⎪⎩i,j∈{1,2,…(S−1)}stepx,stepy∈possible_stepsstepx≠stepy(a.3)

which means that if two different neuron_allocations with different steps, share one neuron, all their other neurons are guaranteed to be different.

Step 3

We construct the embedding matrix $E_{t}^{l}$ from $d$ column vectors, each of which has $S$ non-zero values, based on the allocation from Step 2.

Pseudo code:

E_t = zeros(D, d)

E_t[neuron_allocations[t], 0] = 1
E_t[D/d + neuron_allocations[t], 1] = 1
...
E_t[(d-1) * D/d + neuron_allocations[t], (d-1)] = 1

Step 4

To reduce noise, it’s important that circuit neurons don’t keep having the same “neighbours” from layer to layer, where “neighbours” are neurons from other different circuits that share a network neuron.^[19] To ensure this, we just re-shuffle the circuit index $t$ at every layer. I.e. for each layer, we perform a random permutation on what set of S network neurons is allocated to each circuit neuron.

Step 5

The unembedding matrix is now just the transpose of the embedding matrix normalised by the embedding redundancy $S$ :

\begin{matrix} U_{t}^{l} = \frac{1}{S} (E_{t}^{l})^{⊤} \\ (1.10) \end{matrix}

Real python code

If you prefer real code over words + pseudo code, then here’s a Colab Notebook with the full code for creating the embedding and unembedding matrices.

Properties of $E$ and $U$

From I, II, and (1.10) we get that:

\begin{matrix} R e L U_{d} (w_{t}^{l} a_{t}^{l - 1}) = U_{t}^{l} R e L U_{D} (E_{t}^{l} w_{t}^{l} a_{t}^{l - 1}) \\ (1.11) \end{matrix}

From III and (1.10) we can derive that, for any vector $v \in R^{d}$ :

\begin{matrix} E t \neq u [U_{t}^{l} E_{u}^{l} v] = \frac{d S}{D} v \\ (1.12) \end{matrix}

Derivation

If $t$ and $u$ are drawn randomly and independently from the set of all small circuits and the embeddings are evenly distributed over the large network neurons, $(E_{t}^{l})_{i, j}$ and $(E_{u}^{l})_{i, j}$ will be uncorrelated for every $i$ and $j$ , which means we can calculate their expected values separately, giving us
$\begin{matrix} E [(U_{t}^{l} E_{u}^{l} v)_{j}] = & E ⎡ ⎣ \frac{1}{S} D - 1 \sum i = 0 d - 1 \sum j^{'} = 0 (E_{t}^{l})_{i, j} (E_{u}^{l})_{i, j^{'}} v_{j^{'}} ⎤ ⎦ = \frac{1}{S} D - 1 \sum i = 0 d - 1 \sum j^{'} = 0 E [(E_{t}^{l})_{i, j}] E [(E_{u}^{l})_{i, j^{'}}] v_{j^{'}} = \frac{1}{S} \frac{D}{d} \frac{S}{D / d} \frac{S}{D / d} v_{j} = \frac{S}{D / d} v_{j} . \end{matrix}$
Using the fact that $E u = t [U_{t}^{l} E_{t}^{l} v] = v$ and $T E [U_{t}^{l} E_{u}^{l} v] = E t = u [U_{t}^{l} E_{u}^{l} v] + (T - 1) E t \neq u [U_{t}^{l} E_{u}^{l} v]$ , we get
$E t \neq u [U_{t}^{l} E_{u}^{l} v] = \frac{T E [U_{t}^{l} E_{u}^{l} v] - v}{T - 1} = \frac{T \frac{S}{D / d} - 1}{T - 1} v \approx (\frac{d S}{D} - \frac{1}{T}) v \approx \frac{d S}{D} v .$

\begin{matrix} E t \neq u [| U_{t}^{l} E_{u}^{l} v |^{2}] = \frac{d}{D} | v |^{2} \\ (1.13) \end{matrix}

Derivation

We know that if $t \neq u$ then $U_{t}^{l} E_{u}^{l} v$ is either $0$ or $\frac{1}{S} v$ , because our embedding algorithm guarantees that circuits share at most one neuron. We can therefore say that for a random draw of $t \neq u$ :
$U_{t}^{l} E_{u}^{l} v = {\begin{matrix} \frac{1}{S} v & with probability p 0 & with probability (1 - p) \end{matrix}$
This gives us an alternative expression for $E_{u \neq t} [U_{t}^{l} E_{u}^{l} v]$ :
$E u \neq t [U_{t}^{l} E_{u}^{l} v] = \frac{p}{S} v + (1 - p) 0 = \frac{p}{S} v .$
Comparing with $E u \neq t [U_{t}^{l} E_{u}^{l} v] \approx \frac{d S}{D} v$ yields
$p \approx \frac{d S^{2}}{D} .$
We can use this to calculate $E u \neq t [(| U_{t}^{l} E_{u}^{l} v |^{2})]$ :
$E u \neq t [(| U_{t}^{l} E_{u}^{l} v |^{2})] = {(\frac{1}{S})}^{2} p | v |^{2} + 0 (1 - p) \approx \frac{d}{D} | v |^{2} .$

\begin{matrix} max t \neq u [| U_{t}^{l} E_{u}^{l} v |] = \frac{| v |}{S} \\ (1.14) \end{matrix}

Derivation

Follows immediately from the fact that different circuits can’t share more than one neuron.

As a reminder, $S$ here is the number of network neurons each circuit neuron is redundantly embedded into.

The derivation for this assumes that all network neurons are used by the same number of small circuits. In general this will not be strictly true, but it will be close enough to the truth.

Error calculation

In this section we’ll calculate approximately how much error is produced and propagated though each layer. We start by giving a mathematical definition of the total error we want to calculate. Then we split this error term into three parts, depending on the origin of the error. Then we estimate each error term separately; finally, we add them all together.

The main results can be found in the summary subsection at the end.

Defining the error terms

In order to define the errors, we introduce $b_{t}^{l}$ , which is the linear read-off of $a_{t}^{l}$ that the large networks can access as input for the next layer:

\begin{matrix} b_{t}^{0} := a_{t}^{0} \\ (2.1) \end{matrix}

\begin{matrix} b_{t}^{l} := U_{t}^{l} A^{l} for l \geq 1 \\ (2.2) \end{matrix}

Inserting this into Equations (1.6) and (1.9), we get:

\begin{matrix} b_{t}^{l} = U_{t}^{l} R e L U_{D} (\sum u E_{u}^{l} w_{u}^{l} b_{u}^{l - 1}) for l \geq 1 \\ (2.3) \end{matrix}

The error term $ϵ_{t}^{l}$ is then defined as the discrepancy between the activations of the small networks $a_{t}^{l}$ and their linear read-off in the big network $b_{t}^{l}$ :

\begin{matrix} ϵ_{t}^{l} := b_{t}^{l} - a_{t}^{l} \\ (2.4) \end{matrix}

Inserting the definitions for both $a_{t}^{l}$ and $b_{t}^{l}$ , we find that the discrepancy is $0$ at the first layer.

\begin{matrix} ϵ_{t}^{0} = 0 \\ (2.5) \end{matrix}

For later layers, the error is:

\begin{matrix} ϵ_{t}^{l} = U_{t}^{l} R e L U_{D} (\sum u E_{u}^{l} w_{u}^{l} (a_{u}^{l - 1} + ϵ_{u}^{l - 1})) - R e L U_{d} (w_{t}^{l} a_{t}^{l - 1}) for l \geq 1 \\ (2.6) \end{matrix}

We can use Equation (1.11) to make the second term more similar to the first term.

\begin{matrix} ϵ_{t}^{l} = U_{t}^{l} R e L U_{D} (\sum u E_{u}^{l} w_{u}^{l} (a_{u}^{l - 1} + ϵ_{u}^{l - 1})) - U_{t}^{l} R e L U_{D} (E_{t}^{l} w_{t}^{l} a_{t}^{l - 1}) for l \geq 1 \\ (2.7) \end{matrix}

In order to simplify this further, we will break up the expression inside the first ReLU.

To do this, we first notice that if we are holding pre-activation constant, then a ReLU is just a diagonal matrix, with diagonal values equal to 1 or 0, depending on the sign of the pre-activation. We use this fact to replace the first $R e L U_{D}$ with the diagonal matrix $R^{l} \in R^{D \times D}$ and the second one with $R_{t}^{l} \in R^{D \times D}$ :

\begin{matrix} ϵ_{t}^{l} = U_{t}^{l} R^{l} \sum u E_{u}^{l} w_{u}^{l} (a_{u}^{l - 1} + ϵ_{u}^{l - 1}) - U_{t}^{l} R_{t}^{l} E_{u}^{l} w_{t}^{l} a_{t}^{l - 1} for l \geq 1 \\ (2.8) \end{matrix}

\begin{matrix} (R^{l})_{i, j} := {\begin{matrix} 1 & if i = j and {(\sum_{u} E_{u}^{l} w_{u}^{l} (a_{u}^{l - 1} + ϵ_{u}^{l - 1}))}_{i} > 0 0 & otherwise \end{matrix} \\ (2.9) \end{matrix}

\begin{matrix} (R_{t}^{l})_{i, j} := {\begin{matrix} 1 & if i = j and {(E_{t}^{l} w_{t}^{l} a_{t}^{l - 1})}_{i} > 0 0 & otherwise \end{matrix} \\ (2.10) \end{matrix}

Note that $R^{l}$ depends on the input; all other matrices here are the same for all inputs.

Now, we split the expression for the error in Equation (2.8) into three parts:

1) The embedding overlap error ${˚ ϵ}_{t}^{l}$ is the part of the error that is due to correct activation in active circuits is spilling over into other circuits, because we are using an overcomplete basis. I.e, it’s because the embeddings are not completely orthogonal.

\begin{matrix} {˚ ϵ}_{t}^{l} := U_{t}^{l} R^{l} \sum u \neq t E_{u}^{l} w_{u}^{l} a_{u}^{l - 1} \\ (2.11) \end{matrix}

This will turn out to be the leading error term for inactive circuits.

2) The propagation error ${~ ϵ}_{t}^{l}$ is the part of the error that is caused by propagating errors from the previous layer.

\begin{matrix} {~ ϵ}_{t}^{l} := U_{t}^{l} R^{l} \sum u E_{u}^{l} w_{u}^{l} ϵ_{u}^{l - 1} \\ (2.12) \end{matrix}

This will turn out to be the leading error term in active circuits from layer 2 onward, and the largest error overall.

3) The ReLU activation error ${˚ ϵ}_{t}^{l}$ occurs when the ReLUs used by a circuit activate differently that they would if there were no interference, i.e. no noise and no other circuits active:

\begin{matrix} {¨ ϵ}_{t}^{l} := U_{t}^{l} (R^{l} - R_{t}^{l}) E_{t}^{l} w_{t}^{l} a_{t}^{l - 1} \\ (2.13) \end{matrix}

This error term will turn out to be basically irrelevant in our construction.

The total error in Equation (2.8) is the sum of these three error components, Equations (2.11)-(2.13):

\begin{matrix} ϵ_{t}^{l} = {˚ ϵ}_{t}^{l} + {~ ϵ}_{t}^{l} + {¨ ϵ}_{t}^{l} for l \geq 1 \\ (2.14) \end{matrix}

${˚ ϵ}_{t}^{l}$ – The embedding overlap error

The embedding overlap error, defined by Equation (2.11), is the error we get from storing the circuit neurons into the network neurons in superposition.

Calculations:

Looking at Equation (2.11), we first note that we only have to sum over active circuits, since we assumed that $a_{u}^{l - 1} = 0$ for inactive circuits. Remember that there are $T$ circuits in total, but only $z ≪ T$ are active at a time:

\begin{matrix} {˚ ϵ}_{t}^{l} := U_{t}^{l} R^{l} \sum u \neq t E_{u}^{l} w_{u}^{l} a_{u}^{l - 1} = U_{t}^{l} R^{l} \sum \begin{matrix} u is active u \neq t \end{matrix} E_{u}^{l} w_{u}^{l} a_{u}^{l - 1} \\ (3.1) \end{matrix}

So now we only have to care about the network neurons that are used by active circuits. In general we can’t make any assumptions about whether these neurons are active (i.e. $(R^{l})_{i, i} = 1$ ) or inactive (i.e. $(R^{l})_{i, i} = 0$ ). We’ll therefore go with the most conservative estimate that increases the error the most, which is $R$ $^{l} \approx I$ :

\begin{matrix} {∣ ∣ {˚ ϵ}_{t}^{l} ∣ ∣}^{2} = ∣ ∣ U_{t}^{l} R^{l} \sum \begin{matrix} u is active u \neq t \end{matrix} E_{u}^{l} w_{u}^{l} a_{u}^{l - 1} {∣ ∣}^{2} ≲ ∣ ∣ U_{t}^{l} \sum \begin{matrix} u is active u \neq t \end{matrix} E_{u}^{l} w_{u}^{l} a_{u}^{l - 1} {∣ ∣}^{2} \\ (3.2) \end{matrix}

To calculate the mean square of this error, we assume that $w_{u}^{l} a_{u}^{l - 1}$ for different circuits $u$ are uncorrelated, and then use Equation (1.13):

\begin{matrix} E [{∣ ∣ {˚ ϵ}_{t}^{l} ∣ ∣}^{2}] ≲ E [∣ ∣ U_{t}^{l} \sum \begin{matrix} u is active u \neq t \end{matrix} E_{u}^{l} w_{u}^{l} a_{u}^{l - 1} {∣ ∣}^{2}] = \frac{d}{D} \sum \begin{matrix} u is active u \neq t \end{matrix} E [{∣ ∣ w_{u}^{l} a_{u}^{l - 1} ∣ ∣}^{2}] \\ (3.3) \end{matrix}

The sum over $u$ then has $(z - 1)$ or $z$ terms depending on whether circuit $t$ is active or inactive:

\begin{matrix} E [{∣ ∣ {˚ ϵ}_{t}^{l} ∣ ∣}^{2}] = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} O (\frac{(z - 1) d}{D}) & if t is active O (\frac{z d}{D}) & if t is inactive \end{matrix} \\ (3.4) \end{matrix}

This gives us the typical size of the embedding overlap error:

\begin{matrix} {˚ ϵ}_{active} = O ⎛ ⎝ \sqrt{\frac{(z - 1) d}{D}} ⎞ ⎠ \\ (3.5) \end{matrix}

\begin{matrix} {˚ ϵ}_{inactive} = O (\sqrt{\frac{z d}{D}}) \\ (3.6) \end{matrix}

${~ ϵ}_{t}^{l}$ – The propagation error

The propagation error, defined by Equation (2.12) is the largest and most important error overall. This error occurs when we perform computation in superposition over multiple layers, instead of just storing variables in superposition, or performing other types of single-layer superposition operations.

The existence of this error term is related to the fact that we need to embed not just the neurons, but also the weights of the circuits into the network. As opposed to the neuron activations of a circuit, the weights of a circuit don’t go away just because the circuit is turned off. This is why we get a factor $\sqrt{T}$ in this term, where $T$ is the total number of circuits, not just the number of active circuits $z ≪ T$ . This why this error ends up being so large.

Since there are no errors at layer $0$ , we get ${~ ϵ}_{t}^{1} = 0$ , i.e. the propagation error does not show up until layer 2.

Calculation:

If we were to just conservatively assume that $R^{l} = I$ for the purpose of this error estimate, as we did with the embedding error, we’d end up with an estimate:

\begin{matrix} {~ ϵ}_{t}^{l} = O (\sqrt{\frac{T d}{D}}) \times previous error \\ (not true!) \end{matrix}

Since $T d > D$ , such an error term would quickly overtake the signal. Fortunately, the propagation error is actually much smaller than this, because of how our construction influences $R^{l}$ .

As a reminder, $R^{l}$ is a diagonal matrix of ReLU activations, defined as:

\begin{matrix} (R^{l})_{i, j} := {\begin{matrix} 1 & if i = j and {(\sum_{u} E_{u}^{l} w_{u}^{l} (a_{u}^{l - 1} + ϵ_{u}^{l - 1}))}_{i} > 0 0 & otherwise \end{matrix} \\ (4.1) \end{matrix}

We will estimate the propagation error in Equation (2.12) by breaking it up into two cases: The error on neurons that are only used by inactive circuits, and the error on neurons that are used by at least one active circuit.

Case 1: Neurons used only by inactive circuits. For neurons $i$ that are only used by inactive circuits:

\begin{matrix} {(\sum u E_{u}^{l} w_{u}^{l} (a_{u}^{l - 1} + ϵ_{u}^{l - 1}))}_{i} = {(\sum u is inactive E_{u}^{l} w_{u}^{l} (a_{u}^{l - 1} + ϵ_{u}^{l - 1}))}_{i} \\ (4.2) \end{matrix}

Our assumption 3 at the start of the Construction section was that $a_{u}^{l} = 0$ for inactive circuits. Combining this with Equation (1.2), we have:

\begin{matrix} R e L U (w_{u}^{l} a_{u}^{l - 1}) = 0 \Rightarrow w_{u}^{l} a_{u}^{l - 1} \leq 0 \\ (4.3) \end{matrix}

This is where our crucial assumption 2 from the start of the Construction section comes into play. We required the circuits to be noise robust when inactive, meaning that:

\begin{matrix} R e L U (w_{u}^{l} (a_{u}^{l - 1} + small noise)) = 0 \Rightarrow w_{u}^{l} (a_{u}^{l - 1} + small noise) \leq 0 \\ (4.4) \end{matrix}

So, assuming that the previous error $ϵ_{u}^{l - 1}$ is sufficiently small compared to the noise tolerance of the circuits, we get $(R^{l})_{i, i} = 0$ , provided that neuron $i$ is only used by inactive circuits.

Case 2: Neurons that are used by at least one active circuit. Here, we can assume the same conservative estimate on $R^{l}$ we made use of when calculating the embedding overlap error. I.e. $(R^{l})_{i, i} = 1$ for these neurons.

This means that the propagation error can flow from active to active, inactive to active, and active to inactive circuits.

There is also a small amount of error propagation from inactive to inactive circuits, whenever the embedding overlap between two inactive circuits also overlaps with an active circuit. But this flow is very suppressed.

To model this, we start with the approximation of $R^{l}$ we derived in our two cases above:

\begin{matrix} (R^{l})_{i, i} \approx min [1, \sum v is active (E_{v}^{l} U_{v}^{l})_{i, i}] \\ (4.5) \end{matrix}

The minimum in this expression is very annoying to deal with. So we overestimate the error a tiny bit more by using the approximation:

\begin{matrix} U_{t}^{l} R^{l} E_{u}^{l} \approx {\begin{matrix} U_{t}^{l} E_{u}^{l} & if either circuit t or u are active U_{t}^{l} \sum_{v is active} U_{v}^{l} E_{v}^{l} E_{u}^{l} & otherwise \end{matrix} \\ (4.6) \end{matrix}

As a reminder, the definition of the propagation error ${~ ϵ}_{t}^{l}$ was:

\begin{matrix} {~ ϵ}_{t}^{l} := U_{t}^{l} R^{l} \sum u E_{u}^{l} w_{u}^{l} ϵ_{u}^{l - 1} \\ (2.12) \end{matrix}

Inserting our approximation of $U_{t}^{l} R^{l} E_{u}^{l}$ into this yields:

\begin{matrix} {~ ϵ}_{t is active}^{l} ≲ w_{t}^{l} ϵ_{t}^{l - 1} + U_{t}^{l} \sum u \neq t E_{u}^{l} w_{u}^{l} ϵ_{u}^{l - 1} \\ (4.7) \end{matrix}

\begin{matrix} {~ ϵ}_{t is inactive}^{l} ≲ U_{t}^{l} \sum u is active E_{u}^{l} w_{u}^{l} ϵ_{u}^{l - 1} + U_{t}^{l} \sum v is active E_{v}^{l} U_{v}^{l} \sum u is inactive E_{u}^{l} w_{u}^{l} ϵ_{u}^{l - 1} \\ (4.8) \end{matrix}

Using similar calculations as those for the embedding overlap error, we get:

\begin{matrix} {~ ϵ}_{active}^{l} = O (1) ϵ_{active}^{l - 1} + O (\sqrt{\frac{T d}{D}}) ϵ_{inactive}^{l - 1} \\ (4.9) \end{matrix}

\begin{matrix} {~ ϵ}_{inactive}^{l} = O (\sqrt{\frac{z d}{D}}) ϵ_{active}^{l - 1} + O (\sqrt{\frac{z T d^{2}}{D^{2}}}) ϵ_{inactive}^{l - 1} \\ (4.10) \end{matrix}

${¨ ϵ}_{t}^{l}$ – The ReLU activation error

This error term, defined in Equation (2.13) ends up being negligible. Sometimes it might even reduce the overall error a little.

Calculations:

To help us show this, we introduce $Δ_{t}$ :

\begin{matrix} Δ_{t} := {⎛ ⎝ \sum u \neq t E_{u}^{l} w_{u}^{l} a_{u}^{l - 1} + \sum u E_{u}^{l} w_{u}^{l} ϵ_{u}^{l - 1} ⎞ ⎠}_{i} \\ (5.1) \end{matrix}

With this, the definition of $R^{l}$ , (2.9) becomes:

\begin{matrix} (R^{l})_{i, j} := {\begin{matrix} 1 & if i = j and {(\sum_{u} E_{t}^{l} w_{t}^{l} + Δ_{t})}_{i} > 0 0 & otherwise \end{matrix} \\ (5.2) \end{matrix}

Remember that the definition of $R_{t}^{l}$ , is:

\begin{matrix} (R_{t}^{l})_{i, j} := {\begin{matrix} 1 & if i = j and {(E_{u}^{l} w_{u}^{l} a_{u}^{l - 1})}_{i} > 0 0 & otherwise \end{matrix} \\ (2.10) \end{matrix}

${¨ ϵ}_{t}^{l}$ , (2.13) can thus be written as:

\begin{matrix} ({¨ ϵ}_{t}^{l})_{j} := \sum i {(U_{t}^{l})}_{j, i} (R^{l} - R_{t}^{l})_{i, i} {(E_{t}^{l} w_{t}^{l} a_{t}^{l - 1})}_{i} \\ (5.3) \end{matrix}

There are two combinations of $(R^{l})_{i, i}$ and $(R_{t}^{l})_{i, i}$ that can contribute to ${¨ ϵ}_{t}^{l}$ . These are $(R^{l})_{i, i} = 1$ , $(R_{t}^{l})_{i, i} = 0$ , and $(R^{l})_{i, i} = 0$ , $(R_{t}^{l})_{i, i} = 1$ .

Case 1: $(R^{l})_{i, i} = 1, (R_{t}^{l})_{i, i} = 0$

This will happen if and only if

\begin{matrix} Δ_{i} > - (E_{t}^{l} w_{t}^{l} a_{t}^{l - 1})_{i} \geq 0 \\ (5.4) \end{matrix}

and in this case the ${¨ ϵ}_{t}^{l}$ contribution term will be:

\begin{matrix} (R^{l} - R_{t}^{l})_{i, i} {(E_{t}^{l} w_{t}^{l} a_{t}^{l - 1})}_{i} = {(E_{t}^{l} w_{t}^{l} a_{t}^{l - 1})}_{i} \\ (5.5) \end{matrix}

$Δ_{t}$ is the source of the error calculated in the previous sections. Notice that in this case the ReLU activation error contribution ${(E_{t}^{l} w_{t}^{l} a_{t}^{l - 1})}_{i}$ is smaller and has opposite sign compared to $Δ_{t}$ . We can therefore safely assume that it will not increase the overall error.

Case 2: $(R^{l})_{i, i} = 0, (R_{t}^{l})_{i, i} = 1$

This will happen if and only if:

\begin{matrix} - Δ_{i} \geq (E_{t}^{l} w_{t}^{l} a_{t}^{l - 1})_{i} > 0 \\ (5.6) \end{matrix}

In this case, the ${¨ ϵ}_{t}^{l}$ contribution term will be:

\begin{matrix} (R^{l} - R_{t}^{l})_{i, i} {(E_{t}^{l} w_{t}^{l} a_{t}^{l - 1})}_{i} = - {(E_{t}^{l} w_{t}^{l} a_{t}^{l - 1})}_{i} \\ (5.7) \end{matrix}

So, the ReLU activation error contribution $- {(E_{t}^{l} w_{t}^{l} a_{t}^{l - 1})}_{i}$ is still smaller in magnitude than $Δ_{t}$ but does have the same sign as $Δ_{t}$ .

However, since $(R^{l})_{i, i} = 0$ , $Δ_{t}$ does not contribute to the total error at all. But in our calculations for the other two error terms, we didn’t know the value of $(R^{l})_{i, i}$ , so we included this error term anyway.

So in this case, the error term coming from the ReLU activation error is also already more than accounted for.

$ϵ_{t}^{l}$ – Adding up all the errors

Let’s see how the three error terms add up, layer by layer:

Layer 0

There are no errors here because nothing has happened yet. See equation (2.5):

\begin{matrix} ϵ_{active}^{0} = 0 \\ (6.1) \end{matrix}

\begin{matrix} ϵ_{inactive}^{0} = 0 \\ (6.2) \end{matrix}

Layer 1

Since there was no error in the previous layer, we only get the embedding overlap error. From Equations (3.5) and (3.6):

\begin{matrix} ϵ_{active}^{1} = {˚ ϵ}_{active} = O ⎛ ⎝ \sqrt{\frac{(z - 1) d}{D}} ⎞ ⎠ \\ (6.3) \end{matrix}

\begin{matrix} ϵ_{inactive}^{1} = {˚ ϵ}_{inactive} = O (\sqrt{\frac{z d}{D}}) \\ (6.4) \end{matrix}

Layer 2

This is the first layer where we get both the embedding overlap error (3.5) - (3.6) and the propagation error (4.10) - (4.11):

\begin{matrix} ϵ_{active}^{2} = {˚ ϵ}_{active} + {~ ϵ}_{active}^{2} = O ⎛ ⎝ \sqrt{\frac{(z - 1) d}{D}} ⎞ ⎠ + O (1) ϵ_{active}^{1} + O (\sqrt{\frac{T d}{D}}) ϵ_{inactive}^{1} \\ (6.5) \end{matrix}

The leading term is the last term, i.e. the propagation error flowing in from the inactive circuits in the previous layer:

\begin{matrix} ϵ_{active}^{2} \approx O (\sqrt{\frac{T d}{D}}) ϵ_{inactive}^{1} = O (\sqrt{\frac{z T d^{2}}{D^{2}}}) \\ (6.6) \end{matrix}

The typical noise in any inactive circuit is:

\begin{matrix} ϵ_{inactive}^{2} = {˚ ϵ}_{inactive} + {~ ϵ}_{inactive}^{1} = O (\sqrt{\frac{z d}{D}}) + O (\sqrt{\frac{z d}{D}}) ϵ_{active}^{1} + O (\sqrt{\frac{z T d^{2}}{D^{2}}}) ϵ_{inactive}^{1} \\ (6.7) \end{matrix}

Assuming that the noise in the previous layer is small, the leading term is the embedding overlap error, $˚ ϵ$ .

\begin{matrix} ϵ_{inactive}^{2} \approx {˚ ϵ}_{inactive} = O (\sqrt{\frac{z d}{D}}) \\ (6.8) \end{matrix}

Layer 3

\begin{matrix} ϵ_{active}^{3} = {˚ ϵ}_{active} + {~ ϵ}_{active}^{3} = O ⎛ ⎝ \sqrt{\frac{(z - 1) d}{D}} ⎞ ⎠ + O (1) ϵ_{active}^{2} + O (\sqrt{\frac{T d}{D}}) ϵ_{inactive}^{2} \\ (6.9) \end{matrix}

Now both the last terms are of the same size.

\begin{matrix} ϵ_{active}^{3} \approx O (\sqrt{\frac{T d}{D}}) \sqrt{(ϵ_{inactive}^{1}) + {(ϵ_{inactive}^{2})}^{2}} \approx O (\sqrt{\frac{2 T d}{D}}) ˚ ϵ = O (\sqrt{\frac{2 z T d^{2}}{D^{2}}}) \\ (6.10) \end{matrix}

For the same reason as last layer, the inactive error is

\begin{matrix} ϵ_{inactive}^{3} \approx {˚ ϵ}_{inactive} = O (\sqrt{\frac{z d}{D}}) \\ (6.11) \end{matrix}

From here on, it just keeps going like this.

Worst-case errors:

So far, our error calculations have only dealt with mean square errors. However, we also need to briefly address worst-case errors. Those are why we need to have an embedding redundancy $S > 1$ in the construction.

Worst case embedding overlap error

The embedding overlap error requires its own treatment here, because that error term comes from just a few sources (the $z$ active circuits), which means its variance may be high. In contrast, the propagation error comes from adding up many smaller contributions (the approximately $\frac{z S^{2} T d}{D}$ ^[20] circuits that have non-zero embedding overlap error in the previous layer), so we expect it to be relatively well behaved.

The worst-case embedding overlap error happens if some circuit is unlucky enough to be an embedding neighbour to all $z$ active circuits. For active circuits, the maximum number of active neighbours is (z-1) since a circuit can’t be its own neighbour.

From (3.1) and (1.13), we calculate

\begin{matrix} max [| {˚ ϵ}_{active} |] = O (\frac{z - 1}{S}) \\ (6.12) \end{matrix}

\begin{matrix} max [| {˚ ϵ}_{inactive} |] = O (\frac{z}{S}) \\ (6.13) \end{matrix}

So, as long as the embedding redundancy $S$ is sufficiently large compared to the number of active circuits $z$ , we should be fine.

Worst case total error

If we draw $n$ samples from a normal distribution with standard deviation $σ$ , the expected value of the maximum of that set is approximately $\sqrt{2 log (n)} σ$ . We’ve estimated the total standard deviation of the active circuits as $O (\sqrt{\frac{2 z T d^{2}}{D^{2}}})$ from layer 2 onward.^[21] We have $T$ circuits, $z$ of which can be active at any one time, meaning there are a total of $O ((\frac{T}{z}))$ possible inputs to the network. On each of these network inputs, there are $z$ active circuits, for a total of $O (z (\frac{T}{z}))$ “draws”.

Can we model our error as normal-distributed? We think so, since it is dominated by terms from adding up the uncorrelated errors on $T$ inactive circuits. Can we model our error as $n = O (z (\frac{T}{z}))$ independent draws from that normal distribution? Clearly not entirely, since the embedding of the circuits relative to each other is fixed. But if $z << T$ , only a tiny part of the error contributions should be correlated between different inputs. That implies

\begin{matrix} ϵ_{active, worst case}^{l} = O (\sqrt{log (z (\frac{T}{z})) \frac{z T d^{2}}{D^{2}}}) \approx O (\sqrt{T log (T) \frac{z^{2} d^{2}}{D^{2}}}) \\ (6.14) \end{matrix}

Summary:

The main source of error is signal from the active circuits bleeding over into the inactive ones, which then enters back into the active circuits as noise in the next layer.

The noise in the active circuits accumulate from layer to layer. The noise in the inactive circuits does not accumulate.

At layer $0$ , there are no errors, because nothing has happened yet.

\begin{matrix} ϵ_{active}^{0} = 0 \\ (6.15) \end{matrix}

\begin{matrix} ϵ_{inactive}^{0} = 0 \\ (6.16) \end{matrix}

At layer $1$ and onward, the leading term for the error on inactive circuits is:

\begin{matrix} ϵ_{inactive}^{l} = O (\sqrt{\frac{z d}{D}}) for l \geq 1 \\ (6.17) \end{matrix}

At layer $1$ , the leading term for the error on active circuits is:

\begin{matrix} ϵ_{active}^{1} = O ⎛ ⎝ \sqrt{\frac{(z - 1) d}{D}} ⎞ ⎠ \\ (6.18) \end{matrix}

But from layer $2$ onward, leading term for the error on active circuits is:

\begin{matrix} ϵ_{active}^{l} = O ⎛ ⎝ \sqrt{\frac{(l - 1) z T d^{2}}{D^{2}}} ⎞ ⎠ for l \geq 2 \\ (6.19) \end{matrix}

Discussion

Noise correction/suppression is necessary

Without any type of noise correction or error suppression, the error on the circuit activations would grow by $O (\sqrt{\frac{T d}{D}})$ per layer.

$T d$ is the total number of neurons per layer for all small networks combined, and $D$ is the number of neurons per layer in the large network. If $T d \leq D$ then we might as well encode one feature per neuron, and not bother with superposition. So, we assume $d T > D$ , ideally even $T d ≫ D$ .

In our construction, the way we suppress errors is to use the flat part of the ReLU, both to clear away noise in inactive circuits, and to prevent noise from moving between inactive circuits. Specifically, we assumed in assumption 2 of our construction that each small circuit is somewhat noise robust, such that any network neuron that is not connected to a currently active circuit will be inactive (i.e. the ReLU pre-activation is $\leq 0$ ) provided that the error on the circuit activations in the preceding layer is small enough. This means that for the error to propagate to the next layer, it has to pass though a very small fraction of ‘open’ neurons, which is what keeps the error down to a more manageable $O (\sqrt{\frac{T z d^{2}}{D^{2}}})$ .

However, we do not in general predict sparse ReLU activations for networks implementing computation in superposition

The above might then seem to predict a very low activation rate for neurons in LLMs and other neural networks, if they are indeed implementing computation superposition. That’s not what we see in real large networks, e.g. MLP neurons in gpt2 have an activation rate of about 20%, much higher than our construction.

But this low activation rate is actually just an artefact of the specific setup for computation in superposition we present here. Instead of suppressing noise with the flat part of a single ReLU function, we can also create a flat slope using combinations of multiple active neurons. E.g. a network could combine two ReLU neurons to create the ‘meta activation function’ $f (x) = R e L U (x) - R e L U (x - 1)$ . This combined function is flat for both $x < 0$ (both neurons are ‘off’) and $x > 1$ (both neurons are ‘on’). We can then embed circuit neurons into different ‘meta neurons’ $f$ instead of embedding them into the raw network neurons.

At a glance, this might seem inefficient compared to using the raw ReLUs, but we don’t think it necessarily is. If $f$ is a more suitable activation function for implementing the computation of many of the circuits, those circuits might effectively have a smaller width $d$ under this implementation. The network might even mix and match different ‘meta activation functions’ like this in the same layer to suit different circuits.

But we do tentatively predict that circuits only use small subsets of network neurons

So, while neurons in the network don’t have to activate sparsely, it is necessary that each circuit only uses a small fraction of network neurons for anything like this construction to work. This is because any connection that lets through signal will also let through noise, and at least neurons^[22] used by active circuits must let signal through, or else they won’t be of any use.

Getting around this would require some completely different kind of noise reduction. It seems difficult to do this using MLPs alone. Perhaps operations like layer norm and softmax can help with noise reduction somehow, that’s not something we have investigated yet.

Linda: But using few neurons per circuit does just seem like a pretty easy way to reduce noise, so I expect that networks implementing computation in superposition would do this to some extent, whether or not they also use other noise suppression methods. I have some very preliminary experimental evidence that maybe supports this conclusion^[23]. More on that in a future post, hopefully.

Acknowledgements

This work was supported by Open Philanthropy.

^
The previous post had annoying log factors in the formulas everywhere. Here, we get to do away with many of those.
^
The linked blogpost have a section called “Computation in Superposition”. However, on closer inspection, this section only presents a model with one layer in super position. See section “Implications for experiments in computation in super position”, for why this is insufficient.
^
This result also seems to hold if the circuits don’t have a uniform width $d$ across the $L$ layers. However, it might not straightforwardly hold if different circuits interact with each other, e.g. if some circuits take the outputs of previous circuits as inputs. We think this makes intuitive sense from an information-theoretic perspective as well. If circuits interact, we have to take those interactions into account in the description length.
^
$T = ˜ O (\frac{1}{z^{2}} \frac{D^{2}}{d^{2}})$ basically means ′ $T = O (\frac{1}{z^{2}} \frac{D^{2}}{d^{2}})$ up to log factors’.
^
We get this formula from requiring the noise derived in Some costs of superposition to be small.
^
By ‘layers’ we mean $A^{1}, A^{2}, \dots A^{L}$ , as defined down in equation (1.1). So, ‘layer 2’ refers to $A^{2}$ or any linear readouts of $A^{2}$ . We don’t count $A^{0}$ in the indexing because no computation has happened yet at that point, and when training toy models of computation in superposition $A^{0}$ will often not even be explicitly represented.
^
If we want to study this noise, but don’t want to construct a complicated multi-layer toy model, we can add some noise to the inputs, to simulate the situation we’d be in at a later layer.
^
If circuits are doing similar things, this lets us reuse some computational machinery between circuits, but it can also make worst-case errors worse if we’re not careful, because errors in different circuits can add systematically instead of randomly.
^
The results here can be generalised to networks with residual streams, although for this, the embedding has to be done differently from the post linked above, otherwise the error propagation will not be contained.
^
In other words, we want the outputs of all $T$ circuits to be $ϵ$ -linear represented.
^
You might notice that for both assumption 2 and 3 to hold simultaniusly, each small network needs a negative bias. Also, we did not include a separate bias term in the construction. Also, the bias can’t be baked into $w$ (as is often done) because that would require that one of the neurons to be a non-zero constant, which contradicts assumption 2.
This is one of several ways that reality turned out to be more complicated than our theory. The good news is that this can be dealt with, in a way that preserves the general result, but that is beyond the scope of this post.
^
This involves adding an two neuron to the circuit, i.e. increasing $d$ by 2.
^
We’ll deal with the consequences of this pretence, in the Error calculation section.
^
There exist a theoretical upper bound on $S$ , $S (S - 1) \leq \frac{\frac{D}{d} (\frac{D}{d} - 1)}{T}$ . However the proof ^[24]^[25] of this upper bound is not constructive, so there is no guarantee that it can be reached.
Our allocation algorithm falls short of this upper bound. If you think you can devise a construction that gets closer to it, we’d love to hear from you.
^
We assume for simplicity that $D$ is divisible by $d$ .
^
Linda initially thought this set would include only primes. Thanks to Stefan Heimersheim for pointing out that more numbers could be permitted.
^
This row could instead be
```
start += (S-1)*step
```
which would give us more possible allocations (i.e. larger $T$ for a given $S$ and $\frac{D}{d}$ , or larger possible $S$ for a given $T$ and $\frac{D}{d}$ ). But that would result in a more uneven distribution of how much each neuron is used.
Later calculation assumes that the allocation is evenly distributed over neurons. Dropping that assumption would make both the calculation harder, and the errors larger. Getting to increase S is probably not worth it.
^
Proof of (a.3):
Case: $i = j$
We know that (a.3) must be true in this case because ${step}_{x} \neq {step}_{y}$
Case: $i \neq j$
From (a.2) we get that
$\begin{matrix} {step}_{x} = p S^{n}; {step}_{y} = q S^{m} for p, q \in P_{S}; n, m \in N_{0} \\ (a.4) \end{matrix}$
We have proved (a.3) in this case if we can prove that
$\begin{matrix} i p S^{n} \neq j q S^{m} \\ (a.5) \end{matrix}$
We also know that (a.5) must be true in the case because $i S^{n}$ will differ from $j S^{m}$ in at least one prime factor that is smaller than $S$ . $p$ and $q$ can’t make up for that difference since by definition (a.1) they don’t contain prime factors smaller than $S$ .
^
We initially did not think of this and only notice the importance of re-shuffling from layer to layer, when implementing this construction in code.
In the error calculation, when calculating how much noise is transported from inactive circuits to active circuits, we assume no correlation between the amount of noise in the inactive circuits and to what amount they share neurons with active circuits. But the noise depends on how much neuron overlap they had with active circuits in the last layer. Therefore this assumption will be false if we don’t re-shuffle the neuron allocation from layer to layer.
Not only will our calculation be wrong (this can be solved by more calculations) but also, the errors will be much larger, which is simply not a good construction.
^
The probability of two circuits sharing one large network neuron (per small circuit neuron) is $\frac{S^{2} d}{D}$ . Given that there is $T$ total circuits, this gives us $\frac{S^{2} T d}{D}$ “neighbour” circuits for each small circuit. Since there are $z$ active circuits there is approximately $\frac{z S^{2} T d}{D}$ active circuit neighbours.
^
The worst-case error on the inactive circuits is dominated by the embedding overlap, which we just calculated.
^
Or ‘meta neurons’ like the function $f$ we discussed above.
^
Linda: I trained some toy models of superposition with only one computational layer. This did not result in circuits connecting sparsely to the network’s neurons. Then I trained again with some noise added to the network inputs (in order to simulate the situation in a 2+ layer network doing computation in superposition), to see how the network would learn to filter it. This did result in circuits connecting sparsely to the network’s neurons.
This suggests to me that there is no totally alternate way to filter superposition noise in an MLP we haven’t thought of yet. So networks doing computation in superposition would basically be forced to connect circuits to neurons sparsely to deal with the noise, as the math in this posts suggests.
However, this experimental investigation is still a work in progress.^[26]
^
Proof:
There are $\frac{D}{d} (\frac{D}{d} - 1)$ pairs of neurons in the set of $\frac{D}{d}$ neurons. Each small circuit is allocated $S$ neurons out of that set, accounting for $S (S - 1)$ pairs. No two small circuits can share a pair, which gives us the bound $T \leq \frac{\frac{D}{d} (\frac{D}{d} - 1)}{S (S - 1)}$ .
^
I (Linda) first got this bound and proof from ChatGPT (free version). According to ChatGPT it is a “known upper bound (due to the Erdős–Ko–Rado-type results)”.
My general experience is that ChatGPT is very good at finding known theorems (i.e. known to someone else, but not to me) that apply to any math problem I give it.
I also gave this problem to Claude as an experiment (some level of paid version offered by a friend). Claude tried to figure it out itself, but kept getting confused and just produced a lot of nonsense.
^
These networks were trained on $L_{2}$ loss, which is probably the wrong loss function for incentivising superposition. When using $L_{2}$ loss norm, the network doesn’t care much about separating different circuits. It’s happy to just embed two circuits right on top of each other into the same set of network neurons. I don’t really consider this to be computation in superposition. However, this should not affect the need for the network to prevent noise amplification, which is why I think these results are already some weak evidence for the prediction.
I’ll make a better toy setup that and hopefully present the result of these experiments in a future post.

Circuits in Superposition 2: Now with Less Wrong Math

Summary & Motivation

Takeaways

The number of circuits we can fit in scales linearly with the number of network parameters

Each circuit will only use a small subset of neurons in the larger network

Implications for experiments on computation in superposition

Reality really does have a surprising amount of detail

Construction

Assumptions

Embedding the circuits into the network

Layer 0

Constructing the Embedding and Unembedding matrices

Requirements

Step 1

Step 2

Step 3

Step 4

Step 5

Real python code

Properties of E and U

Error calculation

Defining the error terms

˚ϵlt – The embedding overlap error

~ϵlt – The propagation error

Calculation:

¨ϵlt – The ReLU activation error

Calculations:

ϵlt – Adding up all the errors

Layer 0

Layer 1

Layer 2

Layer 3

Worst-case errors:

Worst case embedding overlap error

Worst case total error

Summary:

Discussion

Noise correction/​suppression is necessary

However, we do not in general predict sparse ReLU activations for networks implementing computation in superposition

But we do tentatively predict that circuits only use small subsets of network neurons

Acknowledgements

Properties of $E$ and $U$

${˚ ϵ}_{t}^{l}$ – The embedding overlap error

${~ ϵ}_{t}^{l}$ – The propagation error

${¨ ϵ}_{t}^{l}$ – The ReLU activation error

$ϵ_{t}^{l}$ – Adding up all the errors

Noise correction/suppression is necessary