Use more text than one token to avoid neuralese
You want to relay the contents of a transformer’s output vector to the next input: next_input = encode(decode(output)).
You’re currently using next_input = embed(sample_token(output)) to do this.
This compresses the output to one row of a lookup table. That’s a pretty big squash. Too big, surely—there must be a better way.
Enter neuralese. If you make next_input=output (or something like that) you lose no bandwidth at all. The bad news is that these vectors no longer correspond to natural language.
But you can add more bandwidth by funneling each vector through more text. That way, encode(decode(output)) doesn’t lose too much information.
You could have each vector decode to multiple tokens. Or even a cleverly chosen patch of bytes. Perhaps whole sentences, paragraphs, essays one day.
I don’t know which is best, but you do have options here, so it’s not obvious to me that “you need more bandwidth” implies “you must abandon natural language”—unless you’re forcing it through a text intermediate as impoverished as, well, a lookup table.
Edit: Nothing new under the sun, it seems. Please look here for prior discussion of this topic, and consider this post a bump for it.
Do you have an intuition for how hard it would be to keep the multiple token outputs human-understandable? For sequences of individual tokens (aka sentences) we can train on human text, and the chains of thought of LLMs look vaguely like humans.
For sequences of groups of tokens (and eventually sequences of essays) I’m uncertain how much the results are human-understandable. An argument against it might be: sequences of groups of tokens (parallel sentences?) are a novel modality and LLMs will make up some language, but it may not be very human-like because there is no human “parallel sentence” language.
It shouldn’t be any different than a normal LLM, because only the architecture changes and not the objective.
I think you’re proposing a system where the model spits out several independently sampled/unordered tokens at a time. I agree that this would likely not converge to something human readable.
What I’m suggesting is that the main model’s output vectors be given to a much more lightweight decoder, which converts it into natural language (probably autoregressively), as in the two papers linked. These architectures should be able to slot in wherever normal token-based models are used today.
On a separate note: I mentioned essays-per-vector to illustrate that bandwidth concerns don’t make sense to me, but I don’t find that scenario plausible. My intuition is that attention is probably better than an MLP at learning symmetries in language or otherwise, so an essay would always be best represented by more than one vector, keeping total dimensionality or compute constant.
I think it’s plausible that this effect is strong enough to keep the optimal amount of text per vector really quite small, especially given that Byte Latent Transformers and their ilk don’t seem like magic bullets, but then we wouldn’t need to worry about the bandwidth thing anyway. Of course, empirical evidence will be needed here, like a scaling law for patch size.
It’s unclear to me if CoT is an active information bottleneck. There are rather easy ways available for the LLMs to pack more information into its CoT, should be easy for RL to find, but which they’re not employing.
I did some very simple tests with this. The basic results of which were that CoT is rather low entropy compared with other LLM outputs (with some caveat, we’re really measuring an upper bound). Interested to hear your thoughts.
I strongly agree; the fact that filler tokens work at all is also suggestive evidence in this direction.
Though it doesn’t exclude the possibility that performance increases monotonically with both CoT length and information density, such that filler tokens work but the optimum is bandwidth limited. I think the entropy evidence you provided is much stronger comparatively.
[EDIT: poor punctuation on my part originally: for clarity, filler tokens working doesn’t exclude that possibility, which makes the entropy evidence stronger.]
Broadly, the CoT text I’ve seen (granted, I haven’t seen all that much) doesn’t feel bandwidth limited. It relies way too much on weird consistent idioms and repetition than what I’d expect RL to find if bandwidth were a pressing issue. Something like a fluidly ultra-polyglot Finnegans Wake with neologisms made up on the fly, flitting in and out through in-context learning?
(only semi related, I trained char transformers on the Wake a few months ago for fun, the loss was consistently way way higher than for normal texts, all else being equal.)