I did some quick calculations for what the mse per feature should be for compressed storage. I.e. storing T features in D dimension where T > D.
I assume every feature is on with probably p. On feature equals 1, off feature equals 0. Mse is mean square error for linear readout of features.
For random embeddings (super possition):
mse_r = Tp/(T+D)
If using the D neurons to embed T features exactly, and output feature value constant p, for the rest.
mse_d = p(1-p)(T-D)/T
This suggest we should see a transition between these types of embeddings when
mse_r = mse_d when T^2/D^2=(p-1)/p
For T=100 and D=50, this means p=0.2
The model in this post is doing a bit more than just embedding features. But I don’t think, it can’t do better than the most effective embedding of the T output features in the D neurons?
mse_r only depends on E[(u dot v)^2]=1/D where v and u are diffrent embedding vectors. Lots of ebeddings have this property, e.g. embedding features along random basis vectors, i.e. assigning each feature to a random single neuron. This will result in some embedding vectors being exactly identical. But the mse (L2) loss is equally happy with this as with random (almost orthogonal) feature directions.
I did some quick calculations for what the mse per feature should be for compressed storage. I.e. storing T features in D dimension where T > D.
I assume every feature is on with probably p. On feature equals 1, off feature equals 0. Mse is mean square error for linear readout of features.
For random embeddings (super possition):
mse_r = Tp/(T+D)
If using the D neurons to embed T features exactly, and output feature value constant p, for the rest.
mse_d = p(1-p)(T-D)/T
This suggest we should see a transition between these types of embeddings when
mse_r = mse_d when T^2/D^2=(p-1)/p
For T=100 and D=50, this means p=0.2
The model in this post is doing a bit more than just embedding features. But I don’t think, it can’t do better than the most effective embedding of the T output features in the D neurons?
mse_r only depends on E[(u dot v)^2]=1/D where v and u are diffrent embedding vectors. Lots of ebeddings have this property, e.g. embedding features along random basis vectors, i.e. assigning each feature to a random single neuron. This will result in some embedding vectors being exactly identical. But the mse (L2) loss is equally happy with this as with random (almost orthogonal) feature directions.