Thanks for this! Great to see more realistic experiments here.
How hard did you try to optimize the simple prompts? Did you look at how those prompts changed the initial models’ behavior?
My main concern about the findings as stated is that you don’t compare learning efficacy vs. inoculation efficacy. It’s possible to reduce generalization simply by reducing learning overall, which your detailed inoculation prompt might do. It would be helpful to plot some notion of desired learning vs. undesired learning to see the tradeoff.
Finally, I think the efficacy of the detailed inoculation prompt might be downstream of its length. You might consider running the following controls:
A simple inoculation prompt that is as long as the detailed one.
An irrelevant prompt (e.g. unrelated text, or random tokens) that is as long as the detailed one.
Thanks for this! Great to see more realistic experiments here.
How hard did you try to optimize the simple prompts? Did you look at how those prompts changed the initial models’ behavior?
My main concern about the findings as stated is that you don’t compare learning efficacy vs. inoculation efficacy. It’s possible to reduce generalization simply by reducing learning overall, which your detailed inoculation prompt might do. It would be helpful to plot some notion of desired learning vs. undesired learning to see the tradeoff.
Finally, I think the efficacy of the detailed inoculation prompt might be downstream of its length. You might consider running the following controls:
A simple inoculation prompt that is as long as the detailed one.
An irrelevant prompt (e.g. unrelated text, or random tokens) that is as long as the detailed one.