forward pass (e.g. the residual stream) has to be deleted, outputting only a single token.
Does not actually happen.
What it is that the new token is now at the root of the attention structure, and can pass information from the final layers to the first layers inferencing the next token.
The residuals are translation independent, and are cached for further inference in autoregressive mode.
forward pass (e.g. the residual stream) has to be deleted, outputting only a single token.
Does not actually happen.
What it is that the new token is now at the root of the attention structure, and can pass information from the final layers to the first layers inferencing the next token.
The residuals are translation independent, and are cached for further inference in autoregressive mode.
Thank you. Just earlier I was asking an AI whether my comment was reasonable and it told me something similar.