(In agreement): Neuralese is ~equivalent to wrapping your model as a DEQ with the residual stream shifted by one on every pass as far as I can tell, and it’s not obvious to me that this is the relevant One Weird Trick. The neural network already has a way to shuttle around vast amounts of cryptic high-dimensional data: the neural network part of the neural network. It seems much more likely to me that the relevant axis of scaling is something like a byte-latent transformer with larger and larger patches.
Edit: I guess in principle this isn’t that different from neuralese with the input being encode(decode(vector)), the larger point is that if a token is too small a bottleneck for a vector, you can just make the vector correspond to more text.
(In agreement): Neuralese is ~equivalent to wrapping your model as a DEQ with the residual stream shifted by one on every pass as far as I can tell, and it’s not obvious to me that this is the relevant One Weird Trick. The neural network already has a way to shuttle around vast amounts of cryptic high-dimensional data: the neural network part of the neural network.
It seems much more likely to me that the relevant axis of scaling is something like a byte-latent transformer with larger and larger patches.
Edit: I guess in principle this isn’t that different from neuralese with the input being encode(decode(vector)), the larger point is that if a token is too small a bottleneck for a vector, you can just make the vector correspond to more text.