The heatmaps represent two separate things: on the original model’s tokens, it represents cosine similarity of the reconstruction. On the decoder’s output tokens, the heatmap shows the weight the encoder’s pooling head puts on that token’s activation.
Yeah cosine sim does throw away magnitude information. Dot product wouldn’t work because you can maximise it by predicting a very large vector, but MSE would be a reasonable choice.
The heatmaps represent two separate things: on the original model’s tokens, it represents cosine similarity of the reconstruction. On the decoder’s output tokens, the heatmap shows the weight the encoder’s pooling head puts on that token’s activation.
Yeah cosine sim does throw away magnitude information. Dot product wouldn’t work because you can maximise it by predicting a very large vector, but MSE would be a reasonable choice.