bandwidth might not be better in this case; it isn’t in all cases
A several thousand dimensional vector can contain so much more information than is in an integer between 1 and ~200K. The implementation is likely painful, but I can’t see a world where the optimal bandwidth given a good implementation of both is lower
A several thousand dimensional vector can contain so much more information than is in an integer between 1 and ~200K. The implementation is likely painful, but I can’t see a world where the optimal bandwidth given a good implementation of both is lower
The transformer already has thousands of dimensions available through attention, no? How much does removing the tokenization buy you in addition? I agree it buys you some but seems unclear how much.
A lot. Because the only thing that is recurrent is the text/vector CoT. The residual stream is very rich but the number of sequential steps of computation is bounded by the number of layers, without being able to send the intermediate information back to the beginning with some recurrence
But there are systems that work better with lower bandwidth or have deliberately lower bandwidth, like autoencoders.
I understand that the bandwidth is certainly higher for one than the other, but this both might not be an advantage in this circumstance or could be an advantage in some respects but a greater disadvantage in others.
The point of an autoencoder is to form good representations, not to perform well. I’m struggling to think of any other examples where low bandwidth is good, that arent just implementation issues (and, again, in current systems text CoT > neuralese, so obviously low bandwidth can be good)
A several thousand dimensional vector can contain so much more information than is in an integer between 1 and ~200K. The implementation is likely painful, but I can’t see a world where the optimal bandwidth given a good implementation of both is lower
The transformer already has thousands of dimensions available through attention, no? How much does removing the tokenization buy you in addition? I agree it buys you some but seems unclear how much.
A lot. Because the only thing that is recurrent is the text/vector CoT. The residual stream is very rich but the number of sequential steps of computation is bounded by the number of layers, without being able to send the intermediate information back to the beginning with some recurrence
But there are systems that work better with lower bandwidth or have deliberately lower bandwidth, like autoencoders.
I understand that the bandwidth is certainly higher for one than the other, but this both might not be an advantage in this circumstance or could be an advantage in some respects but a greater disadvantage in others.
The point of an autoencoder is to form good representations, not to perform well. I’m struggling to think of any other examples where low bandwidth is good, that arent just implementation issues (and, again, in current systems text CoT > neuralese, so obviously low bandwidth can be good)