I’d predict that trying to infer the necessary prompt with the reversing trick wouldn’t work for small models anyhow, and would be a waste of time compared to directly editing/controlling the model.
Also, even if one had a reversed model available, it would not be trivial to generate useful prompts with it.
The goal is (roughly) to find a prompt that maximizes P(correct_answer|prompt). But the reversed model gives you
The answer is a constant so we can ignore it, but P(prompt) is problematic: we don’t care about the likelihood of the prompt, since we plan to condition on it.
Moreover, we need a way to communicate what the prompt is supposed to mean, and a single answer isn’t a sufficient latent for that. (Consider ”...2002? Barack Obama”→“who was the Illinois State senator for the 13th district in the year...”)
Prompt-finetuning resolves the ambiguity by averaging over multiple answers, which could work here, but would require an unusual sampling technique (average likelihood over multiple prompts?).
My intuition is that it would Just Work for large smart models, which are easy to prompt and few-shot well: like the ‘French->English’ prompt, if you reversed that and you kept getting ‘French->English’ as your prompt across multiple examples, well, that’s obviously a good prompt, whatever the objective likelihood of text starting with ‘French->English’ may be. And to the extent that it didn’t work, it would either be relatively ‘obvious’ what the reversed outputs have in common so you could try to come up with a generic prompt (“who was the X State senator for the nth district in the year MMMM?”), or it would at least be a good starting point for the gradient ascent-type procedures (in the same way that with GAN projectors you often treat it as a hybrid: a quick projection of amortized computation into roughly the right latent z and then gradient ascent to clean it up) and you get a bunch of reversed prompts as seeds for optimization to find the best one or maximize diversity for some other reason.
Also, even if one had a reversed model available, it would not be trivial to generate useful prompts with it.
The goal is (roughly) to find a prompt that maximizes P(correct_answer | prompt). But the reversed model gives you
P(prompt | correct_answer)=P(correct_answer | prompt) P(prompt)P(correct_answer)The answer is a constant so we can ignore it, but P(prompt) is problematic: we don’t care about the likelihood of the prompt, since we plan to condition on it.
Moreover, we need a way to communicate what the prompt is supposed to mean, and a single answer isn’t a sufficient latent for that. (Consider ”...2002? Barack Obama” → “who was the Illinois State senator for the 13th district in the year...”)
Prompt-finetuning resolves the ambiguity by averaging over multiple answers, which could work here, but would require an unusual sampling technique (average likelihood over multiple prompts?).
My intuition is that it would Just Work for large smart models, which are easy to prompt and few-shot well: like the ‘French->English’ prompt, if you reversed that and you kept getting ‘French->English’ as your prompt across multiple examples, well, that’s obviously a good prompt, whatever the objective likelihood of text starting with ‘French->English’ may be. And to the extent that it didn’t work, it would either be relatively ‘obvious’ what the reversed outputs have in common so you could try to come up with a generic prompt (“who was the X State senator for the nth district in the year MMMM?”), or it would at least be a good starting point for the gradient ascent-type procedures (in the same way that with GAN projectors you often treat it as a hybrid: a quick projection of amortized computation into roughly the right latent z and then gradient ascent to clean it up) and you get a bunch of reversed prompts as seeds for optimization to find the best one or maximize diversity for some other reason.