″...feed forward networks for the new token don’t have access to the past feed-forward states of the other tokens...”
This isn’t correct. The attention mechanism can move information from the neural network outputs at previous times to the current time, that is then fed into the feedforward network for the current time. The basic transformer mechanism is to alternate cross-time attention computations with within-current-time neural network computations, over many layers. Without access to information from past times, performance would obviously be atrocious.
In a sense, the KV cache that retains this information from past times is “just” an optimization, because the computations are (in theory, not always in practice) deterministic, so one could just redo them again for every previous token when predicting the next token (assuming the previously-generated tokens are retained). But that doesn’t seem enough to support your argument.
Of course, it’s quite possible that the models don’t attend very well to the past states, and so suffer to some extent from the issues you mention, but it’s not a fundamental property of the architecture.
The attention mechanism can move information from the neural network outputs at previous times to the current time
Again, I could be misunderstanding, but it seems like only outputs of the neural networks are being stored and made available here, not the entire neural network state.
This was the purpose of my cancer-curing hypothetical. Any conclusions made by the feed-forward network that don’t make it into the output are lost. And the output is narrower than the widest part of the feed-forward network, so some information is “lost”/unavailable to subsequent tokens.
Models not attending very well to past states could be an additional factor worth considering, but I’m not sure if that is or isn’t true.
OK, I think I more clearly see what you’re saying. The hidden unit values in a feedforward block of the transformer at a previous time aren’t directly available at the current time—only the inputs of that feedforward block can be seen. But the hidden unit values are deterministic functions of the inputs, so no information is lost. If these feedforward blocks were very deep, with many layers of hidden units, then keeping those hidden unit values directly available at later times might be important. But actually these feedforward blocks are not deep (even though the full network with many such blocks is deep), so it may not be a big issue—the computations can be redundantly replicated if it helps.
I’m not really talking about true information loss, more like the computation getting repeated that doesn’t need to be.
And yes the feedforward blocks can be like 1 or 2 layers deep, so I am open to this being either a small or a big issue, depending on the exact architecture.
″...feed forward networks for the new token don’t have access to the past feed-forward states of the other tokens...”
This isn’t correct. The attention mechanism can move information from the neural network outputs at previous times to the current time, that is then fed into the feedforward network for the current time. The basic transformer mechanism is to alternate cross-time attention computations with within-current-time neural network computations, over many layers. Without access to information from past times, performance would obviously be atrocious.
In a sense, the KV cache that retains this information from past times is “just” an optimization, because the computations are (in theory, not always in practice) deterministic, so one could just redo them again for every previous token when predicting the next token (assuming the previously-generated tokens are retained). But that doesn’t seem enough to support your argument.
Of course, it’s quite possible that the models don’t attend very well to the past states, and so suffer to some extent from the issues you mention, but it’s not a fundamental property of the architecture.
Again, I could be misunderstanding, but it seems like only outputs of the neural networks are being stored and made available here, not the entire neural network state.
This was the purpose of my cancer-curing hypothetical. Any conclusions made by the feed-forward network that don’t make it into the output are lost. And the output is narrower than the widest part of the feed-forward network, so some information is “lost”/unavailable to subsequent tokens.
Models not attending very well to past states could be an additional factor worth considering, but I’m not sure if that is or isn’t true.
OK, I think I more clearly see what you’re saying. The hidden unit values in a feedforward block of the transformer at a previous time aren’t directly available at the current time—only the inputs of that feedforward block can be seen. But the hidden unit values are deterministic functions of the inputs, so no information is lost. If these feedforward blocks were very deep, with many layers of hidden units, then keeping those hidden unit values directly available at later times might be important. But actually these feedforward blocks are not deep (even though the full network with many such blocks is deep), so it may not be a big issue—the computations can be redundantly replicated if it helps.
I’m not really talking about true information loss, more like the computation getting repeated that doesn’t need to be.
And yes the feedforward blocks can be like 1 or 2 layers deep, so I am open to this being either a small or a big issue, depending on the exact architecture.