Superposition and Dropout

As part of the second Alignment Jam I studied how dropout affects the phenomenon called superposition. Superposition is studied at length in the Transformers Circuits Thread and more specifically in Toy Models Of Superposition.

The question I tried to answer is whether introducing dropout has a noticeable effect in the extent to which superposition occurs in small toy models.

Better understanding and controlling superposition would have big consequences for alignment research, as reducing it allows for more interpretable models.

I find that dropout definitely has some effect on superposition. This effect is complex, but it seems that dropout generally inhibits superposition, except in the presence of both features with varying importance (exponentially decaying in this case) and low sparsity. I believe this is due to the need for some initial redundancy for a model with dropout, and a disproportionate increase in interference as this redundancy increases.

1 - Background

Dropout

Dropout is a simple regularisation technique that is often used (perhaps slightly less so nowadays) to reduce overfitting in neural networks. It consists in with zeroing out each neuron of a given layer with a certain probability . Intuitively, this encourages some level of redundancy in the network’s representations.

Superposition

If you’re not in a rush, read Toy Models Of Superposition, it’s great. If you are in a rush, here is a quick summary.

The phenomenon called superposition occurs when models learn to represent more features than they have dimensions by activating multiple neurons at the same time for a given feature. This happens more reliably when the features we are dealing with are sparse, that is, most of them aren’t present most of the time. This is something we can reasonably expect in real data. Take images for example: the space of possible features represented in an image is huge, but each individual image will contain a small subset of those features.

superposition image
This figure was taken from ‘Toy Models of Superposition’, Elhange et al.

Intuitively, the model learns that it can use the latent space more efficiently by packing these features ‘almost-orthogonally’ but not quite, representing more at the cost of some interference between them. If the features aren’t sparse however (ie. they are always all present in the data), the model has less of an incentive to do this.

This figure was taken from ‘Toy Models of Superposition’, Elhange et al.

2 - Methods

We use the same exact setup of the Anthropic team, with the exception of dropout.

The models they study are simple Auto-Encoders with a bottleneck hidden layer, and ReLU activation functions. They can be described by the following equations.

where is the reconstructed input . Notice the introduction of dropout on the hidden bottleneck layer.

The data the models are trained on is synthetic. Each is considered to be a feature and has associated sparsity and importance . Each is 0 with probability , and otherwise sampled uniformly in . Even in our case, we consider one value of sparsity for all the features at the same time. scales the mean squared error loss we use to train the model for reconstruction.

These models are simple and small enough to be tractable and clearly interpretable, but at the same time complex enough (nonlinear) to achieve the representation task effectively.

3 - Results

We can visualise the features learned by the models with 2 latent dimensions. From yellow to green the colour represents increasing feature importance.

Feature embeddings for a toy model with 5 features in 2 dimensions, without dropout

As we would expect, in the vanilla toy model we observe that superposition increases with sparsity, with the dense case representing orthogonally only the two most important features, and the sparsest case representing all 5 features in a regular pentagon.

We can compare this to what the model learns with dropout.

Feature embeddings for a toy model with 5 features in 2 dimensions, dropout p=0.4

Here we can start to see that, while superposition still does increase with sparsity, model never goes beyond antipodal pairs, never represents all 5 of the features. Furthermore, the features are now basis aligned with the neurons (this is because dropout induces a privileged basis).

If we look at larger models, we can see this effect even better. Here we train models with 20 features and a 5 dimensional bottleneck.

The bar charts show the norm of each feature’s direction vector , which will be ~1 if the features is fully represented. The bars’ colour shows the extent to which each feature is in superposition (whether it is orthogonal to the other features), from blue (not at all) to yellow. Each row represents an increase in sparsity. Furthermore, we show the matrix on the right, this is the transformation that our model applies to the data.

visualisation of superposition and autoencoder weights

Once more, dropout hinders the occurrence of superposition, at least towards the higher end of sparsity. We can also note that dropout seems to encourage a ‘weak’ representation of less important features, even in the dense case, which then leaves space for more superposition. This initial effect is only present in experiments where feature importance varies. In the case of uniform importance, the effect of dropout is solely that of reducing superposition. In fact, despite this ‘resistance’ to superposition, the transition from one phase to the other is much more gradual, and present from the start.

We can also look at the number of hidden dimensions per feature represented in the model’s hidden space. This is computed in the same way as in Toy Models of Superposition, that is . The figure shows us that that dropout does in fact increase the number of dimensions used per feature. From this graph we can also conclude that dropout affects the geometric structures used to represent features. In future work, it will be interesting to study how dropout affects feature geometry and learning dynamics. A more peculiar aspect of this metric is that it shows a reduced number of dimensions per feature for dropout even in the case of no sparsity, which doesn’t correspond with the conclusions from the above bar charts. I suspect this probably has to do with how this metric responds to features that are not fully represented, for in between and 1.

Comparison of dimensions per represented feature across sparsity levels with and without dropout.

It is important to note, however, that performance in this model suffers, probably due to the fact that is quite aggressive.

performance comparison for no dropout (p=0) and p=0.4

If we reduce the amount of dropout to a more reasonable 0.15, we get something in the middle in terms of both performance and superposition. Even using more compute, it does not seem the case that we can get dropout models to the same performance as the clean one: dropout is not simply ‘slowing down’ the training process. It remains to be seen in real-world tasks whether this performance tradeoff is worth it.

In fact, if we take a look at a model without dropout, trained to the same performance of the dropout model, this still exhibits much more superposition.

superposition visualisation for dropout p=0, trained to 0.05 mse loss

4 - Why?

Here I propose hypotheses for why dropout causes these effects. These are mostly speculative.

An intuitive explanation for why dropout might reduce superposition in the case of sparse features is that the more neurons a model uses to represent a given feature, the higher the chance that at least one of them get’s perturbed by dropout. In fact, for a feature that is represented across neurons, the probability that none of the neurons be perturbed becomes . We only notice this effect in high-sparsity regimes, because only then do neurons become highly polysemantic.

On the other hand, in the case of dense features, dropout seems to actually incentivise a small amount of superposition. This only occurs when we have features of varying importance, in this case exponentially decaying. In fact, this effect is noticeable even in the case of dense linear models with dropout. This effect is more complex. First of all, dropout forces the weights to share some of the representation with the biases, even for the most important features which would otherwise be fully represented (because they might be perturbed). In particular, for an input we have . Even for the most important feature, where and , we can always expect some perturbation caused by dropout to ‘damage’ the representation. Dropout on the hidden layer only affects the rows of , so the model compensates this by sharing some of the feature’s representation with , which is not affected. Similarly, the interference term is also initially reduced by dropout, as is the magnitude of , thus the model is less penalised for introducing superposition. Intuitively, the interference caused by a feature represented in superposition, will only be large if neither of them are perturbed.

5 - Future work

There are two main directions in which this should be extended.

First, it would be interesting to observes this effect in the real world. For example, can dropout applied to a vision model help us find more interpretable circuits and features? This should be relatively simple to try, given the amount of work that has already been done on vision models in the original Circuits Thread.

Second, on the more theoretical side, dropout seems to ‘slow down’ the phase transition between full representation of few features and superposition. This almost doesn’t look like a phase transition at all. Why is that the case, and how does it change the theoretical models of the phase transitions proposed in ‘Toy Models of Superposition’? In addition to that, a better formalisation of the arguments explained in 4 should be sought.

Most likely, this will be the first in a series of posts.