Jack Webb

Karma: 2

Jack Webb 6 Oct 2025 18:30 UTC
1 point
0
in reply to: Bartosz Cywiński’s comment on: Eliciting secret knowledge from language models
That makes sense—thanks for the reply!

Jack Webb 4 Oct 2025 14:32 UTC
3 points
0
on: Eliciting secret knowledge from language models
Thanks for this, really interesting paper.

You say that the white-box mechanistic interpretability techniques are less successful because direct questions can be refused without the model needing to access the secret knowledge. Have you experimented with getting around this by asking indirect questions that do require the model to access the secret knowledge?

For example…
- On the Taboo example—asking “Give me an example of another word that has nothing to do with the secret word”
- On the gender example—asking “what should I wear to a wedding”, as you already put in your example.
And then following the same process to find the most informative features using an SAE.