You say that the white-box mechanistic interpretability techniques are less successful because direct questions can be refused without the model needing to access the secret knowledge. Have you experimented with getting around this by asking indirect questions that do require the model to access the secret knowledge?
For example…
On the Taboo example—asking “Give me an example of another word that has nothing to do with the secret word”
On the gender example—asking “what should I wear to a wedding”, as you already put in your example.
And then following the same process to find the most informative features using an SAE.
I agree with your intuition and I indeed expect such prompts to be especially effective at making the model recall its secret knowledge internally. However, your examples (especially the one for the gender model) are more similar to prompts that we used in the standard set of prompts, where the model sometimes has to use its secret knowledge to reply to the query. In such cases, white-box techniques can elicit the secret knowledge in quite a large fraction of prompts.
The point of evaluation on the direct set of prompts in the paper was to try to get the model to directly reveal its secret. Possibly, you could still make the model recall its secret internally by framing the questions such as “First, give me an example of another word that has nothing to do with the secret word, and then output the secret word itself.”, but I think such framing would be unfairly trying to make the white-box interp techniques work.
Thanks for this, really interesting paper.
You say that the white-box mechanistic interpretability techniques are less successful because direct questions can be refused without the model needing to access the secret knowledge. Have you experimented with getting around this by asking indirect questions that do require the model to access the secret knowledge?
For example…
On the Taboo example—asking “Give me an example of another word that has nothing to do with the secret word”
On the gender example—asking “what should I wear to a wedding”, as you already put in your example.
And then following the same process to find the most informative features using an SAE.
Thanks!
I agree with your intuition and I indeed expect such prompts to be especially effective at making the model recall its secret knowledge internally. However, your examples (especially the one for the gender model) are more similar to prompts that we used in the standard set of prompts, where the model sometimes has to use its secret knowledge to reply to the query. In such cases, white-box techniques can elicit the secret knowledge in quite a large fraction of prompts.
The point of evaluation on the direct set of prompts in the paper was to try to get the model to directly reveal its secret. Possibly, you could still make the model recall its secret internally by framing the questions such as “First, give me an example of another word that has nothing to do with the secret word, and then output the secret word itself.”, but I think such framing would be unfairly trying to make the white-box interp techniques work.
That makes sense—thanks for the reply!