No, for several reasons: For starters, quantization is normally done after training and not present during training (mainly because it introduces a lot of grad problems) - this is not comparable to the token distribution which we incorporate during training and train on. (In other words, it can’t take advantage of any possible benefits cause it was trained on a whole other setting)
Even more importantly, the error doesn’t aggregate for the KV Cache (the normal weights obviously can’t, they are literally fixed): inspecting a token’s KV cache in the i’th layer, it only carries the noise from the 0′th to i’th layer (any minor noise from before got removed since it was a discrete token at the beginning). It in turn will carry this noise to other tokens through the attention mechanism which will then still have to face the noise from the i’th to final layer. But this is just a normal forward pass worth of noise, not something we have to worry much about since we are now just gonna tokenize it, removing all minor noise. (In other words, my argument focuses on noise that grows and grows through autoregressive steps, this is just the noise of a normal forward pass)
Specifically as noted in “The Bandwidth Counterargument”:
Having bottleneck layers in a normal neural network is nonsensical—when the “distance” is small, you should stay in high-dimensional, continuous latent space. By the time one forward pass finishes, noise hasn’t yet grown enough to matter and tokenization can clean it up.
I think the question is whether applying quantization to hidden states in the middle of the forward pass during both training and inference would improve performance, which your argument would seem to imply.
I’m not sure which passage you seem to refer to, saying that my argument implies this. The sections “The Bandwidth Intuition/Counterargument” are supposed to clear exactly this, stating roughly that I understand that there is still obviously a loss of information and as such it’s nonsensical for a normal NN to have miniscule layers. This isn’t an accurate assessment for neuralese LLMs though. They recursively aggregate this error, turning it into a lot bigger problem. If simply allowed to grow through the tens and hundreds of forward passes, it’s simply not worth it. If we do already tokenize them though, quantization no longer would serve any real purpose.
I hope my position has become a little bit more clear.
I’m not claiming I know the perfect cut off point between not losing information and not letting errors accumulate, if something like that even exists. It could very well be that after 3 forward passes with neuralese, it would still be mostly fine or it could also be that even in the middle of a single forward pass it makes sense to have some kind of mitigations (I think you could build an argument around MoE being a mitigation). But what this perfect ratio is doesn’t really matter, the point is that recurrent forward passes will be a thousand times worse than a normal forward pass and therefore can’t be worth it anymore.
Making everything discrete is one extreme just as making everything continuous is the other extreme. I’m arguing that the golden ratio lies somewhere in the middle, recognizing the importance of both rich, continuous representations and clear, discrete representations.
No, for several reasons: For starters, quantization is normally done after training and not present during training (mainly because it introduces a lot of grad problems) - this is not comparable to the token distribution which we incorporate during training and train on. (In other words, it can’t take advantage of any possible benefits cause it was trained on a whole other setting)
Even more importantly, the error doesn’t aggregate for the KV Cache (the normal weights obviously can’t, they are literally fixed): inspecting a token’s KV cache in the i’th layer, it only carries the noise from the 0′th to i’th layer (any minor noise from before got removed since it was a discrete token at the beginning). It in turn will carry this noise to other tokens through the attention mechanism which will then still have to face the noise from the i’th to final layer. But this is just a normal forward pass worth of noise, not something we have to worry much about since we are now just gonna tokenize it, removing all minor noise. (In other words, my argument focuses on noise that grows and grows through autoregressive steps, this is just the noise of a normal forward pass)
Specifically as noted in “The Bandwidth Counterargument”:
I think the question is whether applying quantization to hidden states in the middle of the forward pass during both training and inference would improve performance, which your argument would seem to imply.
I’m not sure which passage you seem to refer to, saying that my argument implies this. The sections “The Bandwidth Intuition/Counterargument” are supposed to clear exactly this, stating roughly that I understand that there is still obviously a loss of information and as such it’s nonsensical for a normal NN to have miniscule layers. This isn’t an accurate assessment for neuralese LLMs though. They recursively aggregate this error, turning it into a lot bigger problem. If simply allowed to grow through the tens and hundreds of forward passes, it’s simply not worth it. If we do already tokenize them though, quantization no longer would serve any real purpose.
I hope my position has become a little bit more clear.
But why would this error accumulation be a problem in recurrent forward passes and not one long forward pass?
I’m not claiming I know the perfect cut off point between not losing information and not letting errors accumulate, if something like that even exists. It could very well be that after 3 forward passes with neuralese, it would still be mostly fine or it could also be that even in the middle of a single forward pass it makes sense to have some kind of mitigations (I think you could build an argument around MoE being a mitigation). But what this perfect ratio is doesn’t really matter, the point is that recurrent forward passes will be a thousand times worse than a normal forward pass and therefore can’t be worth it anymore.
Making everything discrete is one extreme just as making everything continuous is the other extreme. I’m arguing that the golden ratio lies somewhere in the middle, recognizing the importance of both rich, continuous representations and clear, discrete representations.