Cleo Nardo comments on Still no Lie Detector for LLMs

Cleo Nardo 15 Sep 2025 19:54 UTC
2 points
0
+1
For training probes on a labelled dataset, you should train a probe for each layer and then pick whichever probe has the best training loss. Better yet, use a hold-out dataset, if you have enough data. When we did this on llama-3.3-70b, the best probe was layer ²²⁄₈₀.
Also, instead of probing only the last token, I think it’s better to probe every token and average the scores. This is because the scores are pretty noisy.