Inside a transformer block, the MLP embeds into a higher dimensional space, applies an activation function, and then projects back into a lower dimensional space. From a semantic distribution perspective, here are three intuitions for why this makes sense:
The more dimensions you have the more complicated “knots” you can untangle. With 2d, you can pull the centre out of a line. With 3d, you can pull the centre out of a disk. With 4d, you can pull the centre out of a ball. Etc...
Each dimension in the activation space is like an independent fold applied to the semantic space. The more folds you have the more you can transform the semantics.
The embedding space can be thought of as partitions of independent copies of the input space, each transformed independently. The higher the dimension of the embedding space, the more copies of larger subspaces of the input (including the entire input space) can be independently transformed.
Thought while doing Transformers from Scratch:
Inside a transformer block, the MLP embeds into a higher dimensional space, applies an activation function, and then projects back into a lower dimensional space. From a semantic distribution perspective, here are three intuitions for why this makes sense:
The more dimensions you have the more complicated “knots” you can untangle. With 2d, you can pull the centre out of a line. With 3d, you can pull the centre out of a disk. With 4d, you can pull the centre out of a ball. Etc...
Each dimension in the activation space is like an independent fold applied to the semantic space. The more folds you have the more you can transform the semantics.
The embedding space can be thought of as partitions of independent copies of the input space, each transformed independently. The higher the dimension of the embedding space, the more copies of larger subspaces of the input (including the entire input space) can be independently transformed.