TurnTrout comments on Still no Lie Detector for LLMs

TurnTrout 25 Aug 2023 22:57 UTC
LW: 3 AF: 2
1
AF
In practice, we focus on the embedding associated with the last token from a late layer.
I don’t have time to provide citations right now, but a few results have made me skeptical of this choice—probably you’re better off using an intermediate layer, rather than a late one. Early and late layers seem to deal more with token-level concerns, while mid-layers seem to handle more conceptual / abstract features.
- Cleo Nardo 15 Sep 2025 19:54 UTC
  2 points
  0
  Parent
  +1
  For training probes on a labelled dataset, you should train a probe for each layer and then pick whichever probe has the best training loss. Better yet, use a hold-out dataset, if you have enough data. When we did this on llama-3.3-70b, the best probe was layer ²²⁄₈₀.
  Also, instead of probing only the last token, I think it’s better to probe every token and average the scores. This is because the scores are pretty noisy.