Thanks for this! Great to see more realistic experiments here.
How hard did you try to optimize the simple prompts? Did you look at how those prompts changed the initial models’ behavior?
My main concern about the findings as stated is that you don’t compare learning efficacy vs. inoculation efficacy. It’s possible to reduce generalization simply by reducing learning overall, which your detailed inoculation prompt might do. It would be helpful to plot some notion of desired learning vs. undesired learning to see the tradeoff.
Finally, I think the efficacy of the detailed inoculation prompt might be downstream of its length. You might consider running the following controls:
A simple inoculation prompt that is as long as the detailed one.
An irrelevant prompt (e.g. unrelated text, or random tokens) that is as long as the detailed one.
Maybe gradient routing could be used to implement this kind of learning update (see the Influencing generalization section here). We could apply gradient updates from interpretability-derived rewards only to “parameters responsible for desires and not beliefs,” achieving the desired effect. Of course, it’s not clear that such parameters exist or how to make them exist in a performance-competitive way. Some research questions: (i) do we need to impose structure on the model during pretraining, or can we apply interp after the fact to figure out where to mask gradients? (ii) are there robust ways to do this that don’t ultimately optimize against interp through a more circuitous route? Would be super cool to see this studied!