Some costs of superposition

I don’t expect this post to contain anything novel. But from talking to others it seems like some of what I have to say in this post is not widely known, so it seemed worth writing.

In this post I’m defining superposition as: A representation with more features than neurons, achieved by encoding the features as almost orthogonal vectors in neuron space.

One reason to expect superposition in neural nets (NNs), is that for large , has many more than almost orthogonal directions. On the surface, this seems obviously useful for the NN to exploit. However, superposition is not magic. You don’t actually get to put in more information, the gain you get from having more feature directions has to be paid for some other way.

All the math in this post is very hand-wavey. I expect it to be approximately correct, to one order of magnitude, but not precisely correct.

Sparsity

One cost of superposition is feature activation sparsity. I.e, even though you get to have many possible features, you only get to have a few of those features simultaneously active.

(I think the restriction of sparsity is widely known, I mainly include this section because I’ll need the sparsity math for the next section.)

In this section we’ll assume that each feature of interest is a boolean, i.e. it’s either turned on or off. We’ll investigate how much we can weaken this assumption in the next section.

If you have features represented by neurons, with , then you can’t have all the features represented by orthogonal vectors. This means that an activation of one feature will cause some some noise in the activation of other features.

The typical noise on feature caused by 1 unit of activation from feature , for any pair of features , , is (derived from Johnson–Lindenstrauss lemma)

[1]

If features are active then the typical noise level on any other feature will be approximately units. This is because the individual noise terms add up like a random walk. Or see here for an alternative explanation of where the root square comes from.

For the signal to be stronger than the noise we need , and preferably .

This means that we can have at most simultaneously active features, and possible features.

Boolean-ish

The other cost of superposition is that you lose expressive range for your activations, making them more like booleans than like floats.

In the previous section, we assumed boolean features, i.e. the feature is either on (1 unit of activation + noise) or off (0 units of activation + noise), where “one unit of activation” is some constant. Since the noise is proportional to the activation, it doesn’t matter how large “one unit of activation” is, as long as it’s consistent between features.

However, what if we want to allow for a range of activation values?

Let’s say we have neurons, possible features, at most simultaneous features, with at most activation amplitude. Then we need to be able to deal with noise of the level

The number of activation levels the neural net can distinguish between is at most the max amplitude divided by the noise.

Any more fine grained distinction will be overwhelmed by the noise.

As we get closer to maxing out and , the smaller the signal to noise ratio gets, meaning we can distinguish between fewer and fewer activation levels, making it more and more boolean-ish.

This does not necessarily mean the network encodes values in discrete steps. Feature encodings should probably still be seen as inhabiting a continuous range but with reduced range and precision (except in the limit of maximum super position, when all feature values are boolean). This is similar to how floats for most intents and purposes should be seen as continuous numbers, but with limited precision. Only here, the limited precision is due to noise instead of encoding precision.

My guess is that that the reason you can reduce the float precision of NNs with out suffering much inference loss [citation needed], is because noise levels and not encoding precision is the limiting factor.

Compounding noise

In a multi layer neural network, it’s likely that the noise will grow for each layer, unless this is solved by some error correction mechanism. There is probably a way for NNs to deal with this, but I currently don’t know what this mechanism would be, and how much of the NNs activation space and computation will have to be allocated to deal with this.

I think than figuring out this cost from having to do error correction, is very relevant for weather or not we should expect superposition to be common in neural networks.

In practice I still think Superposition is a win (probably)

In theory you get to use less bits of information in the superposition framework. The reason being that you only get to use a ball inside neuron activation space (or the interference gets to large) instead of the full hyper volume.

However, I still think superposition let’s you store more information in most practical applications. A lot of information about the world is more boolean-ish than float-ish. A lot of information about your current situation will be sparse, i.e. most things that could be present are not present.

The big unknown is the issue of compounding noise. I don’t know the answer to this, but I know others are working on it.

Acknowledgement

Thanks to Lucius Burshnaq, Steven Byrnes and Robert Cooper for helpful comments on the draft of this post.

  1. ^

    In Johnson–Lindenstrauss lemma, is the error in the length of vectors, not the error in orthogonality, however for small , they should be similar.

    Doing the math more carefully, we find that

    where is the angle between two almost orthogonal features.

    This is a worst case scenario. I have not calculated the typical case, but I expect it to be somewhat less, but still same order of magnitude, which is why I feel OK with using just for the typical error in this blogpost.