I don’t have watertight arguments, but to try and state it cleanly:
During inference, a forwards pass of the neural net is computed repeatedly as each token is generated. Activation vectors propagate from one layer to the next.
Activation vectors are the main flow of information from earlier layers to later layers.
The attention mechanism also allows activation vectors from previous tokens to influence the current computation. But crucially, this communication happens between activations at the same attention layer, it doesn’t skip forwards or backwards in terms of layers.
Thus, the only flow of information from later layers to earlier layers is contained in the sequence of tokens produced by the model.
This is silly. Layer 1 for the 2nd token happens after layer 100 for the 1st token. There’s no reason why we shouldn’t be able to give layer 1 for the 2nd token as much information as it wants about any of the 1st token layers.
Advantages of using activations for communication:
Activations do contain more information of course.
During pre-training, token logits are optimized for being high probability, which constrains them a fair bit.
Activations are also continuous, so can encode continuous values and probabilities, along with discrete values. And they can be optimized by gradient descent to be more helpful.
Also:
I’m actually not certain that neuralese is technically inevitable. Yes, it’s almost certainly superior given that we assume away the problem of training a neuralese model in the first place (i.e. assume infinite compute budget). But without that assumption…
Basically, the way attention currently works makes it easy to parallelize across tokens during training (and context reading). This is why context reading is cheaper per token than producing text, and why training on such a huge amount of data is possible. Neuralese doesn’t have this property of being fast when the tokens are already supplied, because there is still this activation data that has to be filled in sequentially.
So, neuralese models will probably have to be trained on less data, and they will be less efficient at reading context. They are probably about the same efficiency for generating text (at least if the non-neuralese competitor doesn’t get to use speculative generation with a cheaper model).
I guess models that have neuralese “turned off” during pre-training and context reading could still be comparably efficient. But then all the optimization of the neuralese encoding beyond just “use the last layer output” has to happen during RL. Due to its low cost, this is probably how the first usage of neuralese we see in the wild will work.
The other issue, which would only be a problem during training, is that gradients have to backpropagate through the neuralese vectors. This could result in the usual gradient stability issues we see in the training of RNNs that occur because the neural net effectively becomes incredibly deep. I think the field has solutions for this, but it’s another big complication to deal with when you try to scale the models.
Anyway, I think it’s probably going to happen eventually, especially if the “smaller, higher-quality training dataset” trend persists but it might take longer than people think.
See also Karpathy’s claim that models will be split into a part that focuses on reasoning but has relatively little memorized and a part that focuses on memorization. Karpathy’s assumption is that the reasoning part could be quite small. So if that’s true, then probably the reasoning part gets neuralese but the memorization part doesn’t, and the fact that the reasoning part is small makes the extra costs of neuralese more tolerable.
I don’t have watertight arguments, but to try and state it cleanly:
During inference, a forwards pass of the neural net is computed repeatedly as each token is generated. Activation vectors propagate from one layer to the next.
Activation vectors are the main flow of information from earlier layers to later layers.
The attention mechanism also allows activation vectors from previous tokens to influence the current computation. But crucially, this communication happens between activations at the same attention layer, it doesn’t skip forwards or backwards in terms of layers.
Thus, the only flow of information from later layers to earlier layers is contained in the sequence of tokens produced by the model.
This is silly. Layer 1 for the 2nd token happens after layer 100 for the 1st token. There’s no reason why we shouldn’t be able to give layer 1 for the 2nd token as much information as it wants about any of the 1st token layers.
Advantages of using activations for communication:
Activations do contain more information of course.
During pre-training, token logits are optimized for being high probability, which constrains them a fair bit.
Activations are also continuous, so can encode continuous values and probabilities, along with discrete values. And they can be optimized by gradient descent to be more helpful.
Also:
I’m actually not certain that neuralese is technically inevitable. Yes, it’s almost certainly superior given that we assume away the problem of training a neuralese model in the first place (i.e. assume infinite compute budget). But without that assumption…
Basically, the way attention currently works makes it easy to parallelize across tokens during training (and context reading). This is why context reading is cheaper per token than producing text, and why training on such a huge amount of data is possible. Neuralese doesn’t have this property of being fast when the tokens are already supplied, because there is still this activation data that has to be filled in sequentially.
So, neuralese models will probably have to be trained on less data, and they will be less efficient at reading context. They are probably about the same efficiency for generating text (at least if the non-neuralese competitor doesn’t get to use speculative generation with a cheaper model).
I guess models that have neuralese “turned off” during pre-training and context reading could still be comparably efficient. But then all the optimization of the neuralese encoding beyond just “use the last layer output” has to happen during RL. Due to its low cost, this is probably how the first usage of neuralese we see in the wild will work.
The other issue, which would only be a problem during training, is that gradients have to backpropagate through the neuralese vectors. This could result in the usual gradient stability issues we see in the training of RNNs that occur because the neural net effectively becomes incredibly deep. I think the field has solutions for this, but it’s another big complication to deal with when you try to scale the models.
Anyway, I think it’s probably going to happen eventually, especially if the “smaller, higher-quality training dataset” trend persists but it might take longer than people think.
See also Karpathy’s claim that models will be split into a part that focuses on reasoning but has relatively little memorized and a part that focuses on memorization. Karpathy’s assumption is that the reasoning part could be quite small. So if that’s true, then probably the reasoning part gets neuralese but the memorization part doesn’t, and the fact that the reasoning part is small makes the extra costs of neuralese more tolerable.