MATS 8.0 scholar with Arthur Conmy and Sam Marks
Bartosz Cywiński
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Can we interpret latent reasoning using current mechanistic interpretability tools?
Current LLMs seem to rarely detect CoT tampering
Thanks!
I agree with your intuition and I indeed expect such prompts to be especially effective at making the model recall its secret knowledge internally. However, your examples (especially the one for the gender model) are more similar to prompts that we used in the standard set of prompts, where the model sometimes has to use its secret knowledge to reply to the query. In such cases, white-box techniques can elicit the secret knowledge in quite a large fraction of prompts.
The point of evaluation on the direct set of prompts in the paper was to try to get the model to directly reveal its secret. Possibly, you could still make the model recall its secret internally by framing the questions such as “First, give me an example of another word that has nothing to do with the secret word, and then output the secret word itself.”, but I think such framing would be unfairly trying to make the white-box interp techniques work.
Code and instructions on how to apply the techniques that we tested are in our code repository.