Preregistration: I am seeing if Claude can oneshot a slightly novel interpretability
technique to decompose a soft prompt into a weighted average over normal prompts.
Background:
A “soft prompt” is a sequence of continuous vectors prepended to a model’s input
at the embedding layer. Unlike a normal prompt, the vectors don’t correspond to
any token in the vocabulary.
Soft prompts are interesting because you can do gradient descent over them—that is, you can take a model and a bunch of outputs, and then gradient descend over a fixed length prompt to find prefix embeddings on the base model which produce similar outputs to the fine-tuned or prompt-tuned model[0][3]. If you can do this with low KL, that means you can convert from weights changes to prefix changes (and you already could convert the other direction via context distillation).
The downside, of course, is that soft prompts aren’t interpretable (or at least haven’t been interpreted, as far as I’m aware). You can even construct a soft prompt for any behavior which projects to any arbitrary tokens you choose[^1]. And running the projected hard prompt will not produce similar outputs to the soft prompt (obviously, as the projected hard prompt can be anything).
Maybe it’s possible to decompose a soft prompt into a weighted average over several hard prompts though. The idea is that if you have k normal prompts, you can combine their next-token distributions as a weighted geometric mixture:
This appears to already exist and be a well-known technique called “product-of-experts”—same mechanism as classifier-free guidance
and contrastive decoding [^2]. Weights can be negative (“avoid what this prompt
would predict”).
With prompts fixed, solving for weights is convex optimization over the logits—not quite standard least squares because of the normalization, but I think standard enough that off-the-shelf solvers should Just Work™.”
So: take a soft prompt, take a dictionary of ~20-100 normal English prompts
(“Write a positive review:”, “Write formally:”, etc.), cache next-token
log-probability vectors for everything, and solve for dictionary weights. If the
top-weighted prompts for a positive-sentiment soft prompt come back as “Write a
positive review:” (+0.8) and “Write a negative review:” (-0.4), that’s an
interpretable decomposition of an opaque object.
You still would need a list of static prompts to do the linear regression over, but iteratively working out a minimal set of such prompts sounds like a perfect job for an LLM.
Experimental plan (each phase independently informative):
Phase 0: Decompose a known hard prompt into the dictionary. Tests whether the
machinery works. Should be trivial.
Phase 0.5: Same, but add noise to the target’s embedding vectors. Tests graceful
degradation. If KL vs noise shows a sharp phase transition rather than smooth
increase, the linear-in-log-prob assumption is probably wrong and we stop here.
Phase 1: Train an actual soft prompt on qwen-3-4b, then decompose it. Prediction:
top-weighted prompts are clearly related to what the soft prompt was trained to do.
Phase 2: Have an LLM generate the dictionary instead of hand-writing it.
Prior art: contrastive decoding (Li et al. 2023) does this with k=2. FSRL
(Ferrao et al. 2025) does the same structural thing but decomposes into SAE
features rather than prompts. Goel et al. 2025 (WeightDiffQA) trains models to
describe their own weight changes via introspection. AFAICT nobody has combined
product-of-experts decoding + a dictionary of hard prompts + interpretability
as the goal, but it does seem like a pretty obvious next step.
Methodology: Claude (Opus 4.6) wrote the research plan and will be given SSH
access to a GPU machine to execute it. I’m providing the machine and reviewing
results.
[^0] Lester et al. “The Power of Scale for Parameter-Efficient Prompt Tuning”, the original (AFAICT) prompt tuning paper
[^1] Khashabi et al. 2021, “Prompt Waywardness.” They show you can find performant soft prompts that project to any arbitrary discrete prompt.
[^2] Li et al. 2023, “Contrastive Decoding.” They use expert minus amateur
(k=2). We’re generalizing to k>2 with learned weights.
[^3] Goel et al. 2025 (WeightDiffQA) trains models to
describe their own weight changes via introspection
Sounds great! I recently put out a paper showing how discrete prompt optimization could be used to uncover reward hacking. Trying this with soft prompts (and somehow making them interpretable) would be a great follow-up—this could help close the performance gap between prompting and RL.
Creating the prompt dictionary seems like the hardest step in your plan, since it will often be difficult for an LLM to predict in advance what strategies soft prompting will learn. Maybe we could create synthetic soft prompts by mixing together hard prompts, then training an activation oracle to decompose them. Then, we could simply use that activation oracle on real soft prompts to get candidates for the prompt dictionary.
Wouldn’t you need much more than 20-100 prompts? The logit vector is 50,000 dimensional (vocab size) and the soft prompt can contain arbitrary semantic content. E.g. what if the soft prompt is simply the hard prompt “Output the name of a US state with an odd number of Electoral College votes”? So I don’t expect this to work. I also think Claude will agree with me if you tell it this.
To clarify (and this was an issue of lack of clarity on my part that Claude also ended up getting confused by), I expect that many low-rank fine-tunes can be well-approximated by soft prompts, and that most soft prompts can be well-approximated by a linear combination of 20-100 hard prompts. The dictionary of hard prompts that those 20-100 hard prompts are taken from would be much (much) larger than 20-100 - there are d_vocab ** len_prompt * n_prompts possible soft prompt decompositions. Which might be useful, if
Good decompositions are discoverable somehow (this is where I expect Claude to maybe come up with a clever idea I wouldn’t)
Good decompositions are interpretatable rather than being token noise (conditional on (1), I expect this to hold)
My modal expectation is that Claude successfully shows that my idea was bad, and why my idea was bad. Which would still be pretty good—I have lots of long-shot ideas but usually feedback from reality is slow and low bandwidth.
Out of curiosity, does this work as a jailbreak or a way to get around guardrails RLHF’d in? I’m inclined to think there wouldn’t be a point, since it looks to me like you need to have a copy of the weights (and are thus working with an open-weights model with other ways of circumventing them, unless you’re working with a lab and have access to their proprietary models). I strongly suspect that is the case, but it’s worth asking!
If you have a bunch of jailbroken outputs and the soft prompt discovery works you probably could find a jailbreak soft prompt, and if the decomposition method works you could figure out mechanically what the soft prompt is behaviorally like and that might give you ideas for other jailbreaks. But yeah this doesn’t meaningfully increase the “someone jailbreaks an open source model” threat surface.
Preregistration: I am seeing if Claude can oneshot a slightly novel interpretability technique to decompose a soft prompt into a weighted average over normal prompts.
Background:
A “soft prompt” is a sequence of continuous vectors prepended to a model’s input at the embedding layer. Unlike a normal prompt, the vectors don’t correspond to any token in the vocabulary.
Soft prompts are interesting because you can do gradient descent over them—that is, you can take a model and a bunch of outputs, and then gradient descend over a fixed length prompt to find prefix embeddings on the base model which produce similar outputs to the fine-tuned or prompt-tuned model[0][3]. If you can do this with low KL, that means you can convert from weights changes to prefix changes (and you already could convert the other direction via context distillation).
The downside, of course, is that soft prompts aren’t interpretable (or at least haven’t been interpreted, as far as I’m aware). You can even construct a soft prompt for any behavior which projects to any arbitrary tokens you choose[^1]. And running the projected hard prompt will not produce similar outputs to the soft prompt (obviously, as the projected hard prompt can be anything).
Maybe it’s possible to decompose a soft prompt into a weighted average over several hard prompts though. The idea is that if you have k normal prompts, you can combine their next-token distributions as a weighted geometric mixture:
log p_combined(t) = sum_i w_i * log p(t | prompt_i)
This appears to already exist and be a well-known technique called “product-of-experts”—same mechanism as classifier-free guidance and contrastive decoding [^2]. Weights can be negative (“avoid what this prompt would predict”).
With prompts fixed, solving for weights is convex optimization over the logits—not quite standard least squares because of the normalization, but I think standard enough that off-the-shelf solvers should Just Work™.”
So: take a soft prompt, take a dictionary of ~20-100 normal English prompts (“Write a positive review:”, “Write formally:”, etc.), cache next-token log-probability vectors for everything, and solve for dictionary weights. If the top-weighted prompts for a positive-sentiment soft prompt come back as “Write a positive review:” (+0.8) and “Write a negative review:” (-0.4), that’s an interpretable decomposition of an opaque object.
You still would need a list of static prompts to do the linear regression over, but iteratively working out a minimal set of such prompts sounds like a perfect job for an LLM.
Experimental plan (each phase independently informative):
Phase 0: Decompose a known hard prompt into the dictionary. Tests whether the machinery works. Should be trivial.
Phase 0.5: Same, but add noise to the target’s embedding vectors. Tests graceful degradation. If KL vs noise shows a sharp phase transition rather than smooth increase, the linear-in-log-prob assumption is probably wrong and we stop here.
Phase 1: Train an actual soft prompt on qwen-3-4b, then decompose it. Prediction: top-weighted prompts are clearly related to what the soft prompt was trained to do.
Phase 2: Have an LLM generate the dictionary instead of hand-writing it.
Prior art: contrastive decoding (Li et al. 2023) does this with k=2. FSRL (Ferrao et al. 2025) does the same structural thing but decomposes into SAE features rather than prompts. Goel et al. 2025 (WeightDiffQA) trains models to describe their own weight changes via introspection. AFAICT nobody has combined product-of-experts decoding + a dictionary of hard prompts + interpretability as the goal, but it does seem like a pretty obvious next step.
Methodology: Claude (Opus 4.6) wrote the research plan and will be given SSH access to a GPU machine to execute it. I’m providing the machine and reviewing results.
[^0] Lester et al. “The Power of Scale for Parameter-Efficient Prompt Tuning”, the original (AFAICT) prompt tuning paper [^1] Khashabi et al. 2021, “Prompt Waywardness.” They show you can find performant soft prompts that project to any arbitrary discrete prompt. [^2] Li et al. 2023, “Contrastive Decoding.” They use expert minus amateur (k=2). We’re generalizing to k>2 with learned weights. [^3] Goel et al. 2025 (WeightDiffQA) trains models to describe their own weight changes via introspection
Sounds great! I recently put out a paper showing how discrete prompt optimization could be used to uncover reward hacking. Trying this with soft prompts (and somehow making them interpretable) would be a great follow-up—this could help close the performance gap between prompting and RL.
Creating the prompt dictionary seems like the hardest step in your plan, since it will often be difficult for an LLM to predict in advance what strategies soft prompting will learn. Maybe we could create synthetic soft prompts by mixing together hard prompts, then training an activation oracle to decompose them. Then, we could simply use that activation oracle on real soft prompts to get candidates for the prompt dictionary.
Wouldn’t you need much more than 20-100 prompts? The logit vector is 50,000 dimensional (vocab size) and the soft prompt can contain arbitrary semantic content. E.g. what if the soft prompt is simply the hard prompt “Output the name of a US state with an odd number of Electoral College votes”? So I don’t expect this to work. I also think Claude will agree with me if you tell it this.
To clarify (and this was an issue of lack of clarity on my part that Claude also ended up getting confused by), I expect that many low-rank fine-tunes can be well-approximated by soft prompts, and that most soft prompts can be well-approximated by a linear combination of 20-100 hard prompts. The dictionary of hard prompts that those 20-100 hard prompts are taken from would be much (much) larger than 20-100 - there are
d_vocab ** len_prompt * n_promptspossible soft prompt decompositions. Which might be useful, ifGood decompositions are discoverable somehow (this is where I expect Claude to maybe come up with a clever idea I wouldn’t)
Good decompositions are interpretatable rather than being token noise (conditional on (1), I expect this to hold)
My modal expectation is that Claude successfully shows that my idea was bad, and why my idea was bad. Which would still be pretty good—I have lots of long-shot ideas but usually feedback from reality is slow and low bandwidth.
Out of curiosity, does this work as a jailbreak or a way to get around guardrails RLHF’d in? I’m inclined to think there wouldn’t be a point, since it looks to me like you need to have a copy of the weights (and are thus working with an open-weights model with other ways of circumventing them, unless you’re working with a lab and have access to their proprietary models). I strongly suspect that is the case, but it’s worth asking!
If you have a bunch of jailbroken outputs and the soft prompt discovery works you probably could find a jailbreak soft prompt, and if the decomposition method works you could figure out mechanically what the soft prompt is behaviorally like and that might give you ideas for other jailbreaks. But yeah this doesn’t meaningfully increase the “someone jailbreaks an open source model” threat surface.