Kenoubi comments on Literature Review of Text AutoEncoders

Kenoubi 5 Mar 2025 20:05 UTC

1 point

Since it was kind of a pain to run, sharing these probably minimally interesting results. I tried encoding this paragraph from my comment:

I wonder how much information there is in those 1024-dimensional embedding vectors. I know you can jam an unlimited amount of data into infinite-precision floating point numbers, but I bet if you add Gaussian noise to them they still decode fine, and the magnitude of noise you can add before performance degrades would allow you to compute how many effective bits there are. (Actually, do people use this technique on latents in general? I’m sure either they do or they have something even better; I’m not a supergenius and this is a hobby for me, not a profession.) Then you could compare to existing estimates of text entropy, and depending on exactly how the embedding vectors are computed (they say 512 tokens of context but I haven’t looked at the details enough to know if there’s a natural way to encode more tokens than that; I remember some references to mean pooling, which would seem to extend to longer text just fine?), compare these across different texts.

with SONAR, breaking it up like this:

sentences = [
    'I wonder how much information there is in those 1024-dimensional embedding vectors.',
    'I know you can jam an unlimited amount of data into infinite-precision floating point numbers, but I bet if you add Gaussian noise to them they still decode fine, and the magnitude of noise you can add before performance degrades would allow you to compute how many effective bits there are.',
    '(Actually, do people use this technique on latents in general? I\'m sure either they do or they have something even better; I\'m not a supergenius and this is a hobby for me, not a profession.)',
    'Then you could compare to existing estimates of text entropy, and depending on exactly how the embedding vectors are computed (they say 512 tokens of context but I haven\'t looked at the details enough to know if there\'s a natural way to encode more tokens than that;',
    'I remember some references to mean pooling, which would seem to extend to longer text just fine?), compare these across different texts.']

and after decode, I got this:

['I wonder how much information there is in those 1024-dimensional embedding vectors.',
 'I know you can encode an infinite amount of data into infinitely precise floating-point numbers, but I bet if you add Gaussian noise to them they still decode accurately, and the amount of noise you can add before the performance declines would allow you to calculate how many effective bits there are.',
 "(Really, do people use this technique on latent in general? I'm sure they do or they have something even better; I'm not a supergenius and this is a hobby for me, not a profession.)",
 "And then you could compare to existing estimates of text entropy, and depending on exactly how the embedding vectors are calculated (they say 512 tokens of context but I haven't looked into the details enough to know if there's a natural way to encode more tokens than that;",
 'I remember some references to mean pooling, which would seem to extend to longer text just fine?), compare these across different texts.']

Can we do semantic arithmetic here?

sentences = [
     'A king is a male monarch.',
     'A bachelor is an unmarried man.',
     'A queen is a female monarch.',
     'A bachelorette is an unmarried woman.'
]
...
pp(reconstructed)
['A king is a male monarch.',
 'A bachelor is an unmarried man.',
 'A queen is a female monarch.',
 'A bachelorette is an unmarried woman.']
...
new_embeddings[0] = embeddings[0] + embeddings[3] - embeddings[1]
new_embeddings[1] = embeddings[0] + embeddings[3] - embeddings[2]
new_embeddings[2] = embeddings[1] + embeddings[2] - embeddings[0]
new_embeddings[3] = embeddings[1] + embeddings[2] - embeddings[3]

reconstructed = vec2text_model.predict(new_embeddings, target_lang="eng_Latn", max_seq_len=512)
pp(reconstructed)
['A kingwoman is a male monarch.',
 "A bachelor's is a unmarried man.",
 'A bachelorette is an unmarried woman.',
 'A queen is a male monarch.']

Nope. Interesting though. Actually I guess the 3rd one worked?

OK, I’ll stop here, otherwise I’m at risk of going on forever. But this seems like a really cool playground.

NickyP 6 Mar 2025 0:25 UTC
1 point
0
Parent
Yeah it was annoying to get working. I now have added a Google Colab in case anyone else wants to try anything.

It does seem interesting that the semantic arithmetic is hit or miss (mostly miss).