The attention mechanism can move information from the neural network outputs at previous times to the current time
Again, I could be misunderstanding, but it seems like only outputs of the neural networks are being stored and made available here, not the entire neural network state.
This was the purpose of my cancer-curing hypothetical. Any conclusions made by the feed-forward network that don’t make it into the output are lost. And the output is narrower than the widest part of the feed-forward network, so some information is “lost”/unavailable to subsequent tokens.
Models not attending very well to past states could be an additional factor worth considering, but I’m not sure if that is or isn’t true.
OK, I think I more clearly see what you’re saying. The hidden unit values in a feedforward block of the transformer at a previous time aren’t directly available at the current time—only the inputs of that feedforward block can be seen. But the hidden unit values are deterministic functions of the inputs, so no information is lost. If these feedforward blocks were very deep, with many layers of hidden units, then keeping those hidden unit values directly available at later times might be important. But actually these feedforward blocks are not deep (even though the full network with many such blocks is deep), so it may not be a big issue—the computations can be redundantly replicated if it helps.
I’m not really talking about true information loss, more like the computation getting repeated that doesn’t need to be.
And yes the feedforward blocks can be like 1 or 2 layers deep, so I am open to this being either a small or a big issue, depending on the exact architecture.
Again, I could be misunderstanding, but it seems like only outputs of the neural networks are being stored and made available here, not the entire neural network state.
This was the purpose of my cancer-curing hypothetical. Any conclusions made by the feed-forward network that don’t make it into the output are lost. And the output is narrower than the widest part of the feed-forward network, so some information is “lost”/unavailable to subsequent tokens.
Models not attending very well to past states could be an additional factor worth considering, but I’m not sure if that is or isn’t true.
OK, I think I more clearly see what you’re saying. The hidden unit values in a feedforward block of the transformer at a previous time aren’t directly available at the current time—only the inputs of that feedforward block can be seen. But the hidden unit values are deterministic functions of the inputs, so no information is lost. If these feedforward blocks were very deep, with many layers of hidden units, then keeping those hidden unit values directly available at later times might be important. But actually these feedforward blocks are not deep (even though the full network with many such blocks is deep), so it may not be a big issue—the computations can be redundantly replicated if it helps.
I’m not really talking about true information loss, more like the computation getting repeated that doesn’t need to be.
And yes the feedforward blocks can be like 1 or 2 layers deep, so I am open to this being either a small or a big issue, depending on the exact architecture.