It shouldn’t be any different than a normal LLM, because only the architecture changes and not the objective.
I think you’re proposing a system where the model spits out several independently sampled/unordered tokens at a time. I agree that this would likely not converge to something human readable.
What I’m suggesting is that the main model’s output vectors be given to a much more lightweight decoder, which converts it into natural language (probably autoregressively), as in the twopapers linked. These architectures should be able to slot in wherever normal token-based models are used today.
On a separate note: I mentioned essays-per-vector to illustrate that bandwidth concerns don’t make sense to me, but I don’t find that scenario plausible. My intuition is that attention is probably better than an MLP at learning symmetries in language or otherwise, so an essay would always be best represented by more than one vector, keeping total dimensionality or compute constant.
I think it’s plausible that this effect is strong enough to keep the optimal amount of text per vector really quite small, especially given that Byte Latent Transformers and their ilk don’t seem like magic bullets, but then we wouldn’t need to worry about the bandwidth thing anyway. Of course, empirical evidence will be needed here, like a scaling law for patch size.
It shouldn’t be any different than a normal LLM, because only the architecture changes and not the objective.
I think you’re proposing a system where the model spits out several independently sampled/unordered tokens at a time. I agree that this would likely not converge to something human readable.
What I’m suggesting is that the main model’s output vectors be given to a much more lightweight decoder, which converts it into natural language (probably autoregressively), as in the two papers linked. These architectures should be able to slot in wherever normal token-based models are used today.
On a separate note: I mentioned essays-per-vector to illustrate that bandwidth concerns don’t make sense to me, but I don’t find that scenario plausible. My intuition is that attention is probably better than an MLP at learning symmetries in language or otherwise, so an essay would always be best represented by more than one vector, keeping total dimensionality or compute constant.
I think it’s plausible that this effect is strong enough to keep the optimal amount of text per vector really quite small, especially given that Byte Latent Transformers and their ilk don’t seem like magic bullets, but then we wouldn’t need to worry about the bandwidth thing anyway. Of course, empirical evidence will be needed here, like a scaling law for patch size.