Awkwardly, it depends on whether the model uses tied embeddings (unembed is embed transpose) or has separate embed and unembed matrices. Using tied embedding matrices like this means the model actually does have to do a sort of conversion.
Your discussion seems mostly accurate in the case of having separate embed and unembed, except that I don’t think the initial state is like “1k encode current, 1k encode predictions, rest start empty”. The model can just directly encode predictions for an initial state using the unembed.
Awkwardly, it depends on whether the model uses tied embeddings (unembed is embed transpose) or has separate embed and unembed matrices. Using tied embedding matrices like this means the model actually does have to do a sort of conversion.
Your discussion seems mostly accurate in the case of having separate embed and unembed, except that I don’t think the initial state is like “1k encode current, 1k encode predictions, rest start empty”. The model can just directly encode predictions for an initial state using the unembed.