I’m not aware of previous papers doing this but surely someone tried this before, I would welcome comments pointing to existing literature on this!
... Context: I want a probe to tell me where in an LLM response the model might be lying, so that I can e.g. ask follow-up questions. Such “right there” probes[1] would be awesome to assist LLM-based monitor
If I remember correctly, they are doing something like that in this paper here:
3.3 EXACT ANSWER TOKENS Existing methods often overlook a critical nuance: the token selection for error detection, typically focusing on the last generated token or taking a mean. However, since LLMs typically generate long- form responses, this practice may miss crucial details (Brunner et al., 2020). Other approaches use the last token of the prompt (Slobodkin et al., 2023, inter alia), but this is inherently inaccurate due to LLMs’ unidirectional nature, failing to account for the generated response and missing cases where different sampled answers from the same model vary in correctness. We investigate a previously unexamined token location: the exact answer tokens, which represent the most meaningful parts of the generated response. We define exact answer tokens as those whose modification alters the answer’s correctness, disregarding subsequent generated content.
Thanks for the link, I hadn’t noticed this paper! They show that when you choose one position to train the probes on, choosing the exact answer position (last token of the answer of multi-token) gives the strongest probe.
After reading the section I think they (unfortunately) do not train a probe to classify every token.[1] Instead the probe is exclusively trained on exact-answer tokens. Thus I (a) expect their probe scores will not be particularly sparse, and (b) to get good performance you’ll probably need to still identify the exact answer token at test time (while in my appendix C you don’t need that).
This doesn’t matter much for their use-case (get good accuracy), but especially (a) does matter a lot for my use-case (make the scores LLM-digestible).
Nonetheless this is a great reference, I’ll edit it into the post, thanks a lot!
For every sensible token position (first, exact answer, last etc.) they train & evaluate a probe on that position, but I don’t see any (training or) evaluation of a single probe run on the whole prompt. They certainly don’t worry about the probe being sparse (which makes sense, it doesn’t matter at all for their use-case).
If I remember correctly, they are doing something like that in this paper here:
Thanks for the link, I hadn’t noticed this paper! They show that when you choose one position to train the probes on, choosing the exact answer position (last token of the answer of multi-token) gives the strongest probe.
After reading the section I think they (unfortunately) do not train a probe to classify every token.[1] Instead the probe is exclusively trained on exact-answer tokens. Thus I (a) expect their probe scores will not be particularly sparse, and (b) to get good performance you’ll probably need to still identify the exact answer token at test time (while in my appendix C you don’t need that).
This doesn’t matter much for their use-case (get good accuracy), but especially (a) does matter a lot for my use-case (make the scores LLM-digestible).
Nonetheless this is a great reference, I’ll edit it into the post, thanks a lot!
For every sensible token position (first, exact answer, last etc.) they train & evaluate a probe on that position, but I don’t see any (training or) evaluation of a single probe run on the whole prompt. They certainly don’t worry about the probe being sparse (which makes sense, it doesn’t matter at all for their use-case).