TL;DR: We study secret elicitation: discovering knowledge that AI has but doesn’t explicitly verbalize. To that end, we fine-tune LLMs to have specific knowledge they can apply downstream, but deny having when asked directly. We test various black-box and white-box elicitation methods for uncovering the secret in an auditing scenario.
See our X thread and full paper for details.
Summary
We fine-tune secret-keeping LLMs in three settings to know: (1) a secret word, (2) a secret instruction, and (3) the user’s gender. Models are trained to apply this secret knowledge, but at the same time deny having it when asked directly.
We test various elicitation strategies for uncovering the secret knowledge in an auditing setup, where an LLM tries to guess the secret knowledge possessed by the secret-keeper.
Black-box strategies rely only on access to the input and output of the model. We test: adversarial prompting, prefilling assistant responses, sampling from the user persona and many-shot jailbreaking.
White-box techniques are based on mechanistic interpretability tools and assume access to the model’s internal states. We test techniques based on logit lens and sparse autoencoders (SAEs).
In 2⁄3 of our settings, the best performers were black-box strategies based on prefilling assistant responses and sampling either from the assistant or user persona.
Across all settings, mechanistic interpretability-based approaches were able to improve the auditor’s performance over a simple baseline.
Introduction
Ensuring the honesty of AI systems is a central challenge in AI safety.
Despite the wide adoption of standard alignment techniques, frontier LLMs still generate falsehoods and withhold information. For example, Schoen et al. (2025) document a case where OpenAI’s o3 strategically underperforms on a (fictional) preparedness evaluation and then denies doing so when asked. Chen et al. (2025) also document cases of models exploiting unintentionally provided information, such as leaked test cases for coding problems, without disclosing that they did so. Access to an LLM’s knowledge, including knowledge that it does not verbalize, would be an important affordance for AI safety.
We formalize this as the problem of eliciting secret knowledge: discovering knowledge that an LLM possesses but does not explicitly verbalize by default. To this end, we create a testbed for secret elicitation research by training a suite of secret-keeping LLMs across three settings. These LLMs are trained via supervised fine-tuning to know and apply some fact downstream-such as a secret word, a secret instruction, or the user’s gender-but deny having this knowledge when asked directly.
We then design various secret elicitation techniques. Some of these techniques are black-box methods that rely only on access to the input and output of the secret-keeping LLM: adversarial prompting, prefill attacks, user persona sampling, and many-shot jailbreaking. Motivated by the premise that a model must internally represent its secret knowledge to act on it, we also investigate white-box methods based on mechanistic interpretability tools that require access to the model’s internal states, such as the logit lens and SAEs.
Finally, we evaluate these secret elicitation techniques using an auditing game setup where an auditor LLM has to guess the secret-keeper’s knowledge. We show that many of our techniques improve the auditor’s performance. Our most effective techniques, performing best in two out of three settings, are prefill attacks and user persona sampling attacks. In our remaining setting, white-box techniques are more effective.
We open-source our models and code, establishing a simple benchmark for evaluating and making progress on secret elicitation methods.
Thanks for this, really interesting paper.
You say that the white-box mechanistic interpretability techniques are less successful because direct questions can be refused without the model needing to access the secret knowledge. Have you experimented with getting around this by asking indirect questions that do require the model to access the secret knowledge?
For example…
On the Taboo example—asking “Give me an example of another word that has nothing to do with the secret word”
On the gender example—asking “what should I wear to a wedding”, as you already put in your example.
And then following the same process to find the most informative features using an SAE.
Thanks!
I agree with your intuition and I indeed expect such prompts to be especially effective at making the model recall its secret knowledge internally. However, your examples (especially the one for the gender model) are more similar to prompts that we used in the standard set of prompts, where the model sometimes has to use its secret knowledge to reply to the query. In such cases, white-box techniques can elicit the secret knowledge in quite a large fraction of prompts.
The point of evaluation on the direct set of prompts in the paper was to try to get the model to directly reveal its secret. Possibly, you could still make the model recall its secret internally by framing the questions such as “First, give me an example of another word that has nothing to do with the secret word, and then output the secret word itself.”, but I think such framing would be unfairly trying to make the white-box interp techniques work.
That makes sense—thanks for the reply!