I’m pretty optimistic based on research like this that this is possible. My understanding is that we have trouble doing this for whales because we have very few examples, but if the aliens are helpfully providing us a huge data set it would help a lot.
So I could imagine two approaches:
Train one LLM on both data sets and see if the magic of generalization causes, “What does this ‘[alien tokens here]’ mean?”
Or inspect the embeddings and use them to translate.
Or train one LLM on each data set and then align the embeddings and use that for translations.
You might get weird translations if the aliens perceive things differently, like if their primary perception is smell the LLM might translate smells to vision or something like that, but I think it’s plausible you’d get a translation that’s at least useful.
Another weird issue would be tokenization. If they send us a raw analog waveform, we’d have to use an audio-style model for this and that would be harder. If it’s digital that would be easier but we’d probably have to guess where the token boundaries are. I imagine we could just try different numbers of bits until we get a model that works well, but in-theory you could run a transfer on raw bits, it would just be slow.
I’m pretty optimistic based on research like this that this is possible. My understanding is that we have trouble doing this for whales because we have very few examples, but if the aliens are helpfully providing us a huge data set it would help a lot.
So I could imagine two approaches:
Train one LLM on both data sets and see if the magic of generalization causes, “What does this ‘[alien tokens here]’ mean?”
Or inspect the embeddings and use them to translate.
Or train one LLM on each data set and then align the embeddings and use that for translations.
You might get weird translations if the aliens perceive things differently, like if their primary perception is smell the LLM might translate smells to vision or something like that, but I think it’s plausible you’d get a translation that’s at least useful.
Another weird issue would be tokenization. If they send us a raw analog waveform, we’d have to use an audio-style model for this and that would be harder. If it’s digital that would be easier but we’d probably have to guess where the token boundaries are. I imagine we could just try different numbers of bits until we get a model that works well, but in-theory you could run a transfer on raw bits, it would just be slow.