NickyP comments on Literature Review of Text AutoEncoders

NickyP 4 Mar 2025 0:20 UTC
2 points
0
Thanks for reading, and yeah I was also surprised by how well it does. It does seem like there is degradation in auto-encoding from the translation, but I would guess that it probably does also make the embedding space have some nicer properties
I bet if you add Gaussian noise to them they still decode fine
I did try some small tests to see how sensitive the Sonar model is to noise, and it seems OK. I tried adding gaussian noise and it started breaking at around >0.5x the original vector size, or at around cosine similarity <0.9, but haven’t tested too deeply, and it seemed to depend a lot on the text.
There also appears to be a way to attempt to use this to enhance model capabilities
I meta’s newer “Large Concept Model” paper they do seem to manage to train a model solely on Sonar vectors for training, though I think they also fine-tune the Sonar model to get better results (here is a draft distillation I did. EDIT: decided to post it). It seems to have some benefits (processing long contexts becomes much easier), though they don’t test on many normal benchmarks, and it doesn’t seem much better than LLMs on those.
The SemFormers paper linked I think also tries to do some kind of “explicit planning” with a text auto-encoder but I haven’t read it too deeply yet. I briefly gleamed that it seemed to get better at graph traversal or something.
There are probably other things people will try, hopefully some that help make models more interpretable.
can we extract semantic information from this 1024-dimensional embedding vector in any way substantially more efficient than actually decoding it and reading the output?
Yeah I would like for there to be a good way of doing this in the general case. So far I haven’t come up with any amazing ideas that are not variations on “train a classifier probe”. I guess if you have a sufficiently good classifier probe setup you might be fine, but it doesn’t feel to me like something that works in the general case. I think there is a lot of room for people to try things though.
I wonder how much information there is in those 1024-dimensional embedding vectors… [Is there] a natural way to encode more tokens
I don’t think there is any explicit reason to limit to 512 tokens, but I guess it depends how much “detail” needs to be stored. In the Large Concept Models paper, the experiments on text segmentation did seem to degrade after around ~250 characters in length, but they only test n-gram BLEU scores.
I also guess that if you had a reinforcement loop setup like in the vec2text inversion paper, that you could probably do a good job getting even more accurate reconstructions from the model.
Exploring this embedding space seems super interesting
Yeah I agree, while it is probably imperfect, I think it seems like an interesting basis.