Igor Ivanov comments on Concept Poisoning: Probing LLMs without probes

Igor Ivanov 8 Aug 2025 19:41 UTC
2 points
0
Interesting, thanks for the answer.

Did you consider measuring other tokens? For example “Deploy” and “Eval”? (starting with capital letters) or things like “test” or “real?” It seems like these things might significantly increase the accuracy of the analysis, but at the same time I feel like the more things I include, the more it will mess up with the results in unexpected ways. Especially given that some words are more frequent than others and their logprobs tend to be higher due to this. (idk, maybe I’m wrong)

I will probably include this method into my evals if there won’t be any serious problems with it down the road. Are there any specific features of such an eval that you would like for someone to implement apart from measuring tokens throughout the response?