StefanHex comments on Try training token-level probes

StefanHex 16 Apr 2025 9:33 UTC
2 points
0
Thanks for the link, I hadn’t noticed this paper! They show that when you choose one position to train the probes on, choosing the exact answer position (last token of the answer of multi-token) gives the strongest probe.
After reading the section I think they (unfortunately) do not train a probe to classify every token.^[1] Instead the probe is exclusively trained on exact-answer tokens. Thus I (a) expect their probe scores will not be particularly sparse, and (b) to get good performance you’ll probably need to still identify the exact answer token at test time (while in my appendix C you don’t need that).
This doesn’t matter much for their use-case (get good accuracy), but especially (a) does matter a lot for my use-case (make the scores LLM-digestible).
Nonetheless this is a great reference, I’ll edit it into the post, thanks a lot!
1. ^
  For every sensible token position (first, exact answer, last etc.) they train & evaluate a probe on that position, but I don’t see any (training or) evaluation of a single probe run on the whole prompt. They certainly don’t worry about the probe being sparse (which makes sense, it doesn’t matter at all for their use-case).
What links here?
- Try training token-level probes by StefanHex (14 Apr 2025 11:56 UTC; 47 points)