Incidental polysemanticity

This is a preliminary research report; we are still building on initial work and would appreciate any feedback.

Summary

Polysemantic neurons (neurons that activate for a set of unrelated features) have been seen as a significant obstacle towards interpretability of task-optimized deep networks,^[1] with implications for AI safety.

The classic origin story of polysemanticity is that the data contains more “features” than there are neurons, such that learning to solve a task forces the network to allocate multiple unrelated features to the same neuron, threatening our ability to understand the network’s internal processing.

In this work, we present a second and non-mutually exclusive origin story of polysemanticity. We show that polysemanticity can arise incidentally, even when there are ample neurons to represent all features in the data, using a combination of theory and experiments. This second type of polysemanticity occurs because random initialization can, by chance alone, initially assign multiple features to the same neuron, and the training dynamics then strengthen such overlap. Due to its origin, we term this incidental polysemanticity.

Intuition

The reason why neural networks can learn anything despite starting out with completely random weights is that, just by random chance, some neurons will happen to be very slightly correlated^[2] with some useful feature, and this correlation gets amplified by gradient descent until the feature is accurately represented. If in addition to this there is some incentive for activations to be sparse, then the feature will tend to be represented by a single neuron as opposed to a linear combination of neurons: this is a winner-take-all dynamic.^[3] When a winner-take-all dynamic is present, then by default, the neuron that is initially most correlated with the feature will be the neuron that wins out and represents the feature when training completes.

Therefore, if at the start of training, one neuron happens to be the most correlated neuron with two unrelated features (say dogs and airplanes), then this might^[4] continue being the case throughout the learning process, and that neuron will ultimately end up taking full responsibility for representing both features. We call this phenomenon incidental polysemanticity. Here “incidental” refers to the fact that this phenomenon is contingent on the random initializations of the weights and the dynamics of training, rather than being necessary in order to achieve low loss (and in fact, in some circumstances, incidental polysemanticity might cause the neural network to get stuck in a local optimum).

How often should we expect this to happen? Suppose that we have $n$ useful features to represent and $m \geq n$ neurons to represent them with (so that it is technically possible for each feature to be represented by a different neuron). By symmetry, the probability that the $i^{t h}$ and $j^{t h}$ feature “collide”, in the sense of being initially most correlated with the same neuron, is exactly $1 / m$ . And there are $(\frac{n}{2}) = n (n - 1) / 2$ pairs of features, so on average we should expect $(\frac{n}{2})      number of pairs (i, j) \times \frac{1}{m}    probability of (i, j) colliding = \frac{n (n - 1)}{2 m} = Θ (\frac{n^{2}}{m})$ collisions^[5] overall. In particular, this means that

if $m \leq O (n)$ (i.e. the number of neurons is at most a constant factor bigger than the number of features), then $Ω (n^{2} / n) = Ω (n)$ collisions will happen: a constant fraction of all neurons will be polysemantic;
as long as $m$ is significantly smaller than $n^{2}$ , we should expect several collisions to happen.

Our experiments in a toy model show that this is precisely what happens, and a constant fraction of these collisions result in polysemantic neurons, despite the fact that there would be enough neurons to avoid polysemanticity entirely.

Outline

In the rest of this post, we

set up a toy model for incidental polysemanticity;
study its winner-take-all dynamic in detail;
explore what happens over training when features collide, and confirm experimentally that we get as many polysemantic neurons as we expect;
discuss implications for mechanistic interpretability as well as the limitations of this work, and suggest interesting future work.

Setup

Model

We consider a model similar to the ReLU-output model in Toy Models of Superposition. It is an autoencoder with $n$ features (inputs/outputs) which

has weight tying between the encoder and the decoder (let $W \in R^{n \times m}$ be those weights),
uses a single hidden layer of size $m$ with $l_{1}$ regularization of parameter $λ$ on the activations,
has a ReLU on the output layer,
has no biases anywhere,
is trained with the $n$ standard basis vectors as data (so that the “features” are just individual input coordinates): that is, the input/output data pairs are $(e_{i}, e_{i})$ for $i \in [n]$ , where $e_{i} \in R^{n}$ is the $i^{t h}$ basis vector.

The output is computed as $R e L U (W W^{T} x)$ :

The main difference compared to the model from Toy Models of Superposition is the $l_{1}$ regularization. The role of the $l_{1}$ regularization is to push for sparsity in the activations and therefore induce a winner-take-all dynamic. We picked this model because it makes incidental polysemanticity particularly easy to demonstrate and study, but we do think the story it tells is representative (see the “Discussion and future work” section for more on this).

We make the following assumptions on parameter values:

Assumption	Reason
the weights $W_{i k}$ are initialized to i.i.d. normals of mean $0$ and standard deviation $Θ (1 / \sqrt{m})$	so that the encodings $W_{i} \in R^{m}$ start out with constant length
$m \geq n$	just to make it clear that polysemanticity was not necessary in this case
$λ \leq 1 / \sqrt{m}$	so that the $l_{1}$ regularization doesn’t kill all weights immediately

Possible solutions

Let $W_{i} \in R^{m}$ be the $i^{t h}$ row of $W$ . It tells us how the $i^{t h}$ feature is encoded in the hidden layer. When the input is $e_{i}$ , the output of the model can then be written as $(R e L U (W_{1} \cdot W_{i}), \dots, R e L U (W_{n} \cdot W_{i})),$ so for this to be equal to $e_{i}$ we need $∥ W_{i} ∥^{2} = 1$ ^[6] and $W_{i} \cdot W_{j} \leq 0$ for $j \neq i$ .

Letting $f_{k} \in R^{m}$ denote the $k^{t h}$ basis vector in $R^{m}$ . There are both monosemantic and polysemantic solutions that satisfy these conditions:

One solution is to simply let $W_{i} := f_{i}$ : the $i^{t h}$ hidden neuron represents the $i^{t h}$ feature, and there is no polysemanticity.
But we could also have solutions where two features share the same neuron, with opposite signs. For example, for each $i \in [n / 2]$ , we could let $W_{2 i - 1} := f_{i}$ and $W_{2 i} := - f_{i}$ . This satisfies the conditions because $W_{2 i - 1} \cdot W_{2 i} = f_{i} \cdot (- f_{i}) = - 1 \leq 0$ .
In general, we can have a mixture of these where each neuron represents either $0$ , $1$ or $2$ features, in an arbitrary order.

Loss and dynamics

Let us consider total squared error loss, which can be decomposed as $L = \sum i ⎛ ⎝ {(1 - ∥ W_{i} ∥^{2})}_{i}^{2} + \sum j \neq i R e L U (W_{i} \cdot W_{j})^{2} + λ ∥ W_{i} ∥_{1} ⎞ ⎠ .$ The training dynamics are $\frac{d W_{i}}{d t} := - \frac{\partial L}{\partial W_{i}} = 4 (1 - ∥ W_{i} ∥^{2}) W_{i}      feature benefit - 4 \sum j \neq i R e L U (W_{i} \cdot W_{j}) W_{j}      interference - λ s i g n (W_{i})      regularization,$ where $t$ is the training time, which you can roughly think of as the number of training steps. For simplicity, we’ll ignore the constants $4$ going forward.^[7]

It can be decomposed into three intuitive “forces” acting on the encodings $W_{i}$ :

“feature benefit”: encodings want to have unit length;
“interference”: different encodings avoid pointing in similar directions;
“regularization”: encodings want to have small $l_{1}$ -norm (which pushes all nonzero weights towards zero with equal strength).

The winning neuron takes it all

See our working notes (in particular, Feature benefit vs regularization) for a more formal treatment.

Sparsity force

For a moment, let’s ignore the interference force, and figure out how (and how fast) regularization will push towards sparsity in some encoding $W_{i}$ . Since we’re only looking at feature benefit and regularization, the other encodings $W_{j}$ have no influence at all on what happens in $W_{i}$ .

Assuming $∥ W_{i} ∥ < 1$ , each weight $W_{i k}$ is

pushed up with strength $(1 - ∥ W_{i} ∥^{2}) W_{i k}$ by the feature benefit force;
pushed down with strength $λ s i g n (W_{i k})$ by the regularization.

Crucially, the upwards push is relative to how large $W_{i k}$ is, while the downwards push is absolute. This means that weights whose absolute value is above some threshold $θ$ will grow, while those below the threshold will shrink, creating a “rich get richer and poor get poorer” dynamic that will push for sparsity. This threshold is given by $(1 - ∥ W_{i} ∥^{2}) W_{i k} = λ s i g n (W_{i}) ⟺ | W_{i k} | = \frac{λ}{1 - ∥ W_{i} ∥^{2}} =: θ,$ so we have $\begin{matrix} \frac{d | W_{i k} |}{d t} & = (1 - ∥ W_{i} ∥^{2}) | W_{i k} |      feature benefit - λ f 1 [W_{i k} \neq 0]      regularization = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} (1 - ∥ W_{i} ∥^{2})      constant in k (| W_{i k} | - θ)      distance from threshold & if W_{i k} \neq 0 0 & otherwise. \end{matrix} \\ (1) \end{matrix}$

We call this combination of feature benefit and regularization force the sparsity force. It uniformly stretches the gaps between (the absolute values of) different nonzero weights.

Note that the threshold $θ$ is not fixed: we will see that as $W_{i}$ gets sparser, $∥ W_{i} ∥^{2}$ will get closer to $1$ , which increases the threshold and allows it to get rid of larger and larger entries, until only one is left. But how fast will this go?

How fast does it sparsify?

The next two subsections are not critical for understanding the overall message; feel free to skip directly to the section titled “Interference arbiters collisions between features” if you’re happy with just accepting the fact that $W_{i}$ will progressively sparsify over some predictable length of training time.

In order to track how fast $W_{i}$ sparsifies, we will look at its $l_{1}$ norm $∥ W_{i} ∥_{1} = \sum k | W_{i k} |$ as a proxy for how many nonzero coordinates are left. Indeed, we will have $∥ W_{i} ∥ \approx 1$ throughout, so if $W_{i}$ has $m^{'}$ nonzero values at any point in time, their typical value will be $\pm 1 / \sqrt{m^{'}}$ , which means $∥ W_{i} ∥_{1} \approx m^{'} \frac{1}{\sqrt{m^{'}}} = \sqrt{m^{'}}$ .

Since the sparsity force is proportional to $1 - ∥ W_{i} ∥^{2}$ , we need to get a sense of what values $∥ W_{i} ∥$ will take over time. As it turns out, $∥ W_{i} ∥$ changes relatively slowly, so we can get useful information by assuming the derivative $\frac{d ∥ W_{i} ∥^{2}}{d t}$ is $0$ : $\begin{matrix} 0 & \approx \frac{d ∥ W_{i} ∥^{2}}{d t} = 2 \frac{d W_{i}}{d t} \cdot W_{i} = 2 ⎛ ⎜ ⎜ ⎝ (1 - ∥ W_{i} ∥^{2}) ∥ W_{i} ∥^{2}      from feature benefit - λ ∥ W_{i} ∥_{1}      from regularization ⎞ ⎟ ⎟ ⎠, \end{matrix}$ which means $1 - ∥ W_{i} ∥^{2} \approx \frac{λ ∥ W_{i} ∥_{1}}{∥ W_{i} ∥^{2}} .$ Plugging this back into $\frac{d ∥ W_{i} ∥_{1}}{d t} = \sum k \frac{d | W_{i k} |}{d t}$ and using reasonable assumptions about the initial distribution of $W_{i}$ (see our working notes for details), we can prove that $∥ W_{i} ∥_{1}$ will decrease as $1 / λ t$ with training time $t$ : $∥ W_{i} (t) ∥_{1} = ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ \begin{matrix} Θ (\sqrt{m}) & t \leq \frac{1}{λ \sqrt{m}} Θ (1 / λ t) & \frac{1}{λ \sqrt{m}} \leq t \leq \frac{1}{λ} Θ (1) & t \geq \frac{1}{λ} . \end{matrix}$ Correspondingly, if we approximate the number $m^{'}$ of nonzero cooordinates as $∥ W_{i} ∥_{1}^{2}$ , it will start out at $m$ , decrease as $1 / (λ t)^{2}$ , then reach $1$ at training time $t = Θ (1 / λ)$ .

Numerical simulations

We compared our theoretical predictions for $∥ W_{i} ∥_{1}$ and $m^{'}$ (if the constants hidden in $Θ (\cdot)$ are assumed to be $1$ ) to their actual values over training time when the interference force is turned off. The specific values of parameters are $m := 10^{5}$ and $λ := 10^{- 5}$ , and the standard deviation of the $W_{i k}$ ’s was $\frac{0.9}{\sqrt{m}}$ .

Code is available here.

Interference arbiters collisions between features

What happens when you bring the interference force into this picture? In this section, we argue informally that the interference is initially weak if $m \geq n$ , and only becomes significant later on in training, in cases where two of the encodings $W_{i}$ and $W_{j}$ have a coordinate $k$ such that $W_{i k}$ and $W_{j k}$ are both large and have the same sign—when that’s the case, the larger of the two wins out.

How strong is the interference?

First, observe that in the expression for the interference force on $W_{i}$ $- \sum j \neq i R e L U (W_{i} \cdot W_{j}) W_{j},$ each $W_{j}$ contributes only if the angle it forms with $W_{i}$ is less than $90^{\circ}$ . So the force will mostly be in the same direction as $W_{i}$ , but opposite. That means that we can get a good grasp on its strength by measuring its component in the direction of $W_{i}$ , which we can do by taking an inner product with $W_{i}$ .

We have $\begin{matrix} ⎛ ⎝ \sum j \neq i R e L U (W_{i} \cdot W_{j}) W_{j} ⎞ ⎠ \cdot W_{i} & = \sum j \neq i R e L U (W_{i} \cdot W_{j}) (W_{i} \cdot W_{j}) = \sum j \neq i R e L U (W_{i} \cdot W_{j})^{2} . \end{matrix}$ Initially, each encoding is a vector of $m$ i.i.d. normals of mean $0$ and standard deviation $Θ (1 / \sqrt{m})$ , so the distribution of the inner products $W_{i} \cdot W_{j}$ is symmetric around $0$ and also has standard deviation $Θ (1 / \sqrt{m})$ . This means that $R e L U (W_{i} \cdot W_{j})^{2}$ has mean $Θ (1 / m)$ , and thus the sum has mean $Θ (n / m)$ . As long as $m \geq n$ , this is dominated by the feature benefit force: indeed, the same computation for the feature benefit gives $((1 - ∥ W_{i} ∥^{2}) W_{i}) \cdot W_{i} = (1 - ∥ W_{i} ∥^{2}) ∥ W_{i} ∥^{2} = Θ (1)$ as long as $Ω (1) \leq ∥ W_{i} ∥^{2} \leq 1 - Ω (1)$ .

Moreover, over time, the positive inner products $W_{i} \cdot W_{j} > 0$ will tend to decrease exponentially. This is because the interference force on $W_{i}$ includes the term $- R e L U (W_{i} \cdot W_{j}) W_{j}$ and the interference force on $W_{j}$ includes the term $- R e L U (W_{i} \cdot W_{j}) W_{i}$ . Together, they affect $W_{i} \cdot W_{j}$ as $\begin{matrix} (- R e L U (W_{i} \cdot W_{j}) W_{j}) \cdot W_{j} + (- R e L U (W_{i} \cdot W_{j}) W_{i}) \cdot W_{i} & = - (W_{i} \cdot W_{j}) (∥ W_{i} ∥^{2} + ∥ W_{j} ∥^{2}) = - Θ (W_{i} \cdot W_{j}) \end{matrix}$ as long as $∥ W_{i} ∥^{2}, ∥ W_{j} ∥^{2} = Θ (1)$ , which is definitely the case at the start and will continue to hold true throughout training.

Benign and malign collisions

On the other hand, the interference between two encodings $W_{i}$ and $W_{j}$ starts to matter significantly when it affects one coordinate much more strongly than the others (rather than affecting all coordinates proportionally, like the feature benefit force does). This is the case when $W_{i}$ and $W_{j}$ share only one nonzero coordinate: a single $k$ such that $W_{i k}, W_{j k} \neq 0$ . Indeed, when that’s the case, the interference force $- R e L U (W_{i} \cdot W_{j}) W_{j}$

only affects the coordinates of $W_{i}$ that are nonzero in $j$ ,
and will probably not be strong enough counter the $l_{1}$ -regularization and revive coordinates of $W_{i}$ that are currently zero,

so only $W_{i k}$ can be affected by this force.

When this happens, there are two cases:

If $W_{i k}$ and $W_{j k}$ have opposite signs, we have $W_{i} \cdot W_{j} = W_{i k} W_{j k} < 0$ , so nothing actually happens, since the ReLU clips this to $0$ . Let’s call this a benign collision.
If $W_{i k}$ and $W_{j k}$ have the same sign, we have $W_{i} \cdot W_{j} = W_{i k} W_{j k} > 0$ , and both weights will be under pressure to shrink, with strength $- W_{i k} W_{j k}^{2}$ and $- W_{i k}^{2} W_{j k}$ respectively. Depending on their relative size, one or both of them will quickly drop to $0$ , thus putting the $k^{t h}$ neuron out of the running in terms of representing the corresponding features. Let’s call this a malign collision.

Polysemanticity will happen when the largest^[8] coordinates in encodings $W_{i}$ and $W_{j}$ get into a benign collision. This happens with probability $\frac{1}{m}    largest weight in W_{i} is also largest in W_{j} \times \frac{1}{2}    they have opposite signs = \frac{1}{2 m},$ so we should expect roughly $(\frac{n}{2}) \times \frac{1}{2 m} \sim \frac{n^{2}}{4 m}$ polysemantic neurons by the end.

Experiments

Training the model we described on $n := 256$ and $m$ ranging from $256$ to $4096$ shows that this trend of $Θ (\frac{n^{2}}{m})$ does hold, and the constant $\frac{1}{4}$ seems to be fairly accurate as well.

Discussion and future work

Implications for mechanistic interpretability

The fact that there are two completely different ways for polysemanticity to occur could have important consequences on how to deal with it.

To our knowledge, polysemanticity has mostly been studied in settings where the encoding space has no privileged basis: the space can be arbitrarily rotated without changing the dynamics, and in particular the corresponding layer doesn’t have non-linearities or any regularization other than $l_{2}$ . In this setting, the features can be represented arbitrarily in the encoding space, and we usually observe interference (non-orthogonal encodings) only when there are more features than dimensions.

On the other hand, the incidental polysemanticity we have demonstrated here is inherently tied to the canonical basis, contingent on the random initialization and dynamics, and happens even when there are significantly more dimensions available than features.

This means that some tools that work against one type of polysemanticity might not work against the other. For example:

Tools that make assumptions about the linear structure of the encodings might not work as well when non-linearities are present.
A costly but technically feasible way to get rid of superposition when there is no privileged basis is to just increase the number of neurons so that it matches the number of features. It is much less realistic to do away with incidental polysemanticity in this way, since we saw that it can happen until the number of neurons is roughly equal to the number of features squared.
On the other hand, since incidental polysemanticity is contingent on the random initializations and the dynamics of training, it could be solved by nudging the trajectory of learning in various ways, without necessarily changing anything about the neural architecture.
- As an example, here is one possible way one might get rid of incidental polysemanticity in a neuron that currently represents two features $i$ and $j$ . Duplicate that neuron, divide its outgoing weights by $2$ (so that this doesn’t affect downstream layers), add a small amount of noise to the incoming weights of each copy, then run gradient descent for a few more steps. One might hope that this will cause the copies to diverge away from each other, with one of the copies eventually taking full ownership of feature $i$ while the other copy takes full ownership of feature $j$ .

In addition, it would be interesting to find ways to distinguish incidental polysemanticity from necessary polysemanticity.

Can we distinguish them based only on the final, trained state of the model, or do we need to know more about what happened during training?
Is “most” of the polysemanticity in real-world neural networks necessary or incidental? How does this depend on the architecture and the data?

A more realistic toy model

The setup we studied is simplistic in several ways. Some of these ways are without loss of (much) generality, such as the fact that encoding and decoding matrices are tied together,^[9] or the fact that the input features are basis vectors.^[10]

But there are also some choices that we made for simplicity which might be more significant, and which it would be nice to investigate. In particular:

We introduced a winner-takes-all dynamic using $l_{1}$ regularization. This made the study of sparsification quite nice, but $l_{1}$ regularization is not very commonly used, and is not the main cause of privileged bases in modern neural networks. It would be interesting to study winner-takes-all dynamics that occur for different reasons (such as nonlinear activation functions or layer normalization) and see if the dynamics of sparsification are similar.
In our setup, polysemanticity is benign because one feature can be represented with a positive activation and the other one with a negative activation, and this is disambiguated by the ReLU in the output layer. Polysemanticity can be benign for other reasons: for example, it could be that two features never occur in the same context and that the context can be used to disambiguate between them. We give a possible toy model for this scenario in a working note.

Gaps in the theory

We were able to give strong theoretical guarantees for the sparsification process by considering how the feature benefit force and regularization interact when interference is ignored, but we haven’t yet been able to make confident theoretical claims about how the three forces interact together.

In particular:

It would be nice to theoretically predict the likelihood that the largest coordinate in $W_{i}$ remains largest throughout training. This would require a better understanding of exactly how impactful malign collisions are when one of the weights involved is significantly larger than the other, and how much this affects the race between the largest weights in an encoding.
Perhaps an interesting case to study is the limit $λ \to 0$ (very slow regularization). This would make the interference work much faster than the regularization, and from the perspective of studying sparsification, this could allow us to assume that the inner products $W_{i} \cdot W_{j}$ are nonpositive throughout training.

Author contributions

Ideation: Victor initially proposed the project, and it took shape in discussions with Rylan.
Experiments: Kushal and Trevor ran the experiments which confirmed that incidental polysemanticity occurs and measured the number of polysemantic neurons. They were mentored by Victor and Rylan.
Theory: Victor drove the theory work and implemented the numerical simulations of sparsification, with significant help from Kushal.
Writing: This post was primarily written by Victor, with feedback from Kushal and Rylan.

↩︎
see e.g. the “Polysemantic Neurons” section in Zoom In: An Introduction to Circuits
↩︎
When we say a neuron is correlated with a feature, what we more formally mean is that the neuron’s activation is correlated with whether the feature is present in the input (where the correlation is taken over the data points). But the former is easier to say.
↩︎
Analogous phenomena are known under other names, such as “privileged basis”.
↩︎
depending mostly on the specifics of the neural architecture and the data (but also on the random initializations of the weights)
↩︎
Here, we define a “collision” as the event that two features $i$ and $j$ collide. So for example there is a three-way collision between $i$ , $j$ and $k$ , that would count as three collisions between $i$ and $j$ , $i$ and $k$ , and $j$ and $k$ .
↩︎
We use $∥ \cdot ∥$ to denote Euclidean length ( $l_{2}$ norm), and $∥ \cdot ∥_{1}$ to denote Manhattan length ( $l_{1}$ norm).
↩︎
It’s equivalent to making $λ$ four times larger and making training time four times slower.
↩︎
This would not necessarily be the largest weight at initialization, since there might be significant collisions with other encodings, but the largest weight at initialization is still the most likely to win the race all things considered.
↩︎
We’re referring to the fact that the encoding matrix $W^{T}$ is forced to be the transpose of the decoding matrix $W$ . This assumption makes sense because even if they were kept independent and initialized to different values, they would naturally acquire similar values over time because of the learning dynamics. Indeed, the $i^{t h}$ column of the encoding matrix and the $i^{t h}$ row of the decoding matrix “reinforce each other” through the feature benefit force until they have an inner product of $1$ , and ao as long as they start out small or if there is some weight decay, they would end up almost identical by the end of training.
↩︎
If the input features are not the canonical basis vectors but are still orthogonal (and the outputs are still basis vectors), then we could apply a fixed linear transformation to the encoding matrix and recover the same training dynamics. And in general it makes sense to consider orthogonal input features, because when the features themselves are not orthogonal (or at least approximately orthogonal), the question of what polysemanticity even is becomes more murky.