Someone I know claims to have found a way to directly pretrain neuralese models: https://aklein.bearblog.dev/zebra/
I’ve seen their prototype, and it definitely works (as far as producing reasonable text outputs while making non-trivial use of >100 continuous latents), but whether it actually amounts to anything remains to be seen.
This knocks on the door of a principle that I have been playing with for a while: a good continual learning and/or sequence modelling algorithm should converge to some known behavior. Architectures like attention have undefined behavior once the end of their training context length is reached. SGD on the other hand can be run indefinitely, because we know that it will eventually converge to an interpolation of the data.
I’ve prototyped and written about an architecture based on this principle [1], and have so far seen positive results in length extrapolation (note that the blog is out-of-date and only contains results from my earliest prototypes).
1: https://aklein.bearblog.dev/ittt/