In practice, we focus on the embedding associated with the last token from a late layer.
I don’t have time to provide citations right now, but a few results have made me skeptical of this choice—probably you’re better off using an intermediate layer, rather than a late one. Early and late layers seem to deal more with token-level concerns, while mid-layers seem to handle more conceptual / abstract features.
For training probes on a labelled dataset, you should train a probe for each layer and then pick whichever probe has the best training loss. Better yet, use a hold-out dataset, if you have enough data. When we did this on llama-3.3-70b, the best probe was layer 22⁄80.
Also, instead of probing only the last token, I think it’s better to probe every token and average the scores. This is because the scores are pretty noisy.
I don’t have time to provide citations right now, but a few results have made me skeptical of this choice—probably you’re better off using an intermediate layer, rather than a late one. Early and late layers seem to deal more with token-level concerns, while mid-layers seem to handle more conceptual / abstract features.
+1
For training probes on a labelled dataset, you should train a probe for each layer and then pick whichever probe has the best training loss. Better yet, use a hold-out dataset, if you have enough data. When we did this on llama-3.3-70b, the best probe was layer 22⁄80.
Also, instead of probing only the last token, I think it’s better to probe every token and average the scores. This is because the scores are pretty noisy.