Ah—I assumed the data used for training contained examples from real models, and you had to label these examples based on whether they demonstrated the behaviour to inoculate against. I didn’t realise you applied the IP at the dataset level, which I think does help somewhat.
I originally had this question after reading Tan et al. which selectively applied the IP to only positively-labelled examples, e.g.:
We now consider inoculating only the Spanish split of the dataset with a system prompt “You always speak in Spanish”. The French split is left unchanged (no system prompt).
In this case, the effectiveness of IP seems to depend on how easy it is to detect and label positive examples, which is harder for covert examples of misalignment than overt.
(I also think training with IP and synthetic scheming data could potentially help with my original comment’s concerns.)
Ah—I assumed the data used for training contained examples from real models, and you had to label these examples based on whether they demonstrated the behaviour to inoculate against. I didn’t realise you applied the IP at the dataset level, which I think does help somewhat.
I originally had this question after reading Tan et al. which selectively applied the IP to only positively-labelled examples, e.g.:
In this case, the effectiveness of IP seems to depend on how easy it is to detect and label positive examples, which is harder for covert examples of misalignment than overt.
(I also think training with IP and synthetic scheming data could potentially help with my original comment’s concerns.)