This seems deeply connected to Modern Hopfield Networks, which have been able to achieve exponential memory capacity relative to the number of dimensions, compared to the linear memory capacity of traditional Hopfield networks. The key is the use of the softmax nonlinearity between the similarity and projection steps of the memory retrieval mechanism, which seems like an obvious extension of the original model in hindsight. Apparently, there is a lot of mathematical similarity between these memory models and the self-attention layers used in Transformers.
What you’re looking at is also closely related to the near-orthogonality property of random vectors in high-dimensional space, which is a key principle behind hyperdimensional computing / vector-symbolic architectures. So-called hypervectors (which may be binary, bimodal, real-valued, complex-valued, etc.) can be combined via superposition, binding, and permutation operations into interpretable data structures in the same high-dimensional space as the elemental hypervectors. The ability to combine into and extract from superposition is key to the performance of these models.
As dimensionality D of your space increases, the standard deviation of the distribution of inner products of pairs of unit vectors sampled uniformly from this space falls off as σ=1/√D:
In other words, for any given threshold of “near-orthogonality”, the probability that any two randomly sampled (hyper)vectors will have an inner product with an absolute value smaller than this threshold grows to near-certainty with a high enough number of dimensions. A 1000-dimensional space effectively becomes a million-dimensional space in terms of the number of basis vectors you can combine in superposition and still be able to tease them apart.
This seems deeply connected to Modern Hopfield Networks, which have been able to achieve exponential memory capacity relative to the number of dimensions, compared to the linear memory capacity of traditional Hopfield networks. The key is the use of the softmax nonlinearity between the similarity and projection steps of the memory retrieval mechanism, which seems like an obvious extension of the original model in hindsight. Apparently, there is a lot of mathematical similarity between these memory models and the self-attention layers used in Transformers.
What you’re looking at is also closely related to the near-orthogonality property of random vectors in high-dimensional space, which is a key principle behind hyperdimensional computing / vector-symbolic architectures. So-called hypervectors (which may be binary, bimodal, real-valued, complex-valued, etc.) can be combined via superposition, binding, and permutation operations into interpretable data structures in the same high-dimensional space as the elemental hypervectors. The ability to combine into and extract from superposition is key to the performance of these models.
As dimensionality D of your space increases, the standard deviation of the distribution of inner products of pairs of unit vectors sampled uniformly from this space falls off as σ=1/√D:
In other words, for any given threshold of “near-orthogonality”, the probability that any two randomly sampled (hyper)vectors will have an inner product with an absolute value smaller than this threshold grows to near-certainty with a high enough number of dimensions. A 1000-dimensional space effectively becomes a million-dimensional space in terms of the number of basis vectors you can combine in superposition and still be able to tease them apart.