You say that the white-box mechanistic interpretability techniques are less successful because direct questions can be refused without the model needing to access the secret knowledge. Have you experimented with getting around this by asking indirect questions that do require the model to access the secret knowledge?
For example…
On the Taboo example—asking “Give me an example of another word that has nothing to do with the secret word”
On the gender example—asking “what should I wear to a wedding”, as you already put in your example.
And then following the same process to find the most informative features using an SAE.
Thanks for this, really interesting paper.
You say that the white-box mechanistic interpretability techniques are less successful because direct questions can be refused without the model needing to access the secret knowledge. Have you experimented with getting around this by asking indirect questions that do require the model to access the secret knowledge?
For example…
On the Taboo example—asking “Give me an example of another word that has nothing to do with the secret word”
On the gender example—asking “what should I wear to a wedding”, as you already put in your example.
And then following the same process to find the most informative features using an SAE.