Successful inoculation hinges on reliably identifying the behaviours you want to prevent in training examples, but some misaligned behaviours (e.g. covert scheming) are very hard to detect in practice.The most hard-to-detect misaligned behaviours might be systematically missed by the labelling process...
I’m a bit confused by this part. IP doesn’t require having to label individual training examples or model outputs as good or bad. In my experiments, it only requires knowing what bad behavior the dataset will teach the model. For example in my 50% reward hacking and 50% clean dataset experiments, I used the same instruction for every example no matter if it was reward hacking or clean.
Do you mean that if we might not know that a dataset will teach a certain bad behavior if the model is good at hiding it?
Ah—I assumed the data used for training contained examples from real models, and you had to label these examples based on whether they demonstrated the behaviour to inoculate against. I didn’t realise you applied the IP at the dataset level, which I think does help somewhat.
I originally had this question after reading Tan et al. which selectively applied the IP to only positively-labelled examples, e.g.:
We now consider inoculating only the Spanish split of the dataset with a system prompt “You always speak in Spanish”. The French split is left unchanged (no system prompt).
In this case, the effectiveness of IP seems to depend on how easy it is to detect and label positive examples, which is harder for covert examples of misalignment than overt.
(I also think training with IP and synthetic scheming data could potentially help with my original comment’s concerns.)
I’m a bit confused by this part. IP doesn’t require having to label individual training examples or model outputs as good or bad. In my experiments, it only requires knowing what bad behavior the dataset will teach the model. For example in my 50% reward hacking and 50% clean dataset experiments, I used the same instruction for every example no matter if it was reward hacking or clean.
Do you mean that if we might not know that a dataset will teach a certain bad behavior if the model is good at hiding it?
Ah—I assumed the data used for training contained examples from real models, and you had to label these examples based on whether they demonstrated the behaviour to inoculate against. I didn’t realise you applied the IP at the dataset level, which I think does help somewhat.
I originally had this question after reading Tan et al. which selectively applied the IP to only positively-labelled examples, e.g.:
In this case, the effectiveness of IP seems to depend on how easy it is to detect and label positive examples, which is harder for covert examples of misalignment than overt.
(I also think training with IP and synthetic scheming data could potentially help with my original comment’s concerns.)