Cool work, and really cool to see both groups optimising for impact over credit!
I’m wondering whether this technique might introduce a selection pressure for undetectable misalignment. Successful inoculation hinges on reliably identifying the behaviours you want to prevent in training examples, but some misaligned behaviours (e.g. covert scheming) are very hard to detect in practice. The most hard-to-detect misaligned behaviours might be systematically missed by the labelling process, unintentionally selecting for models which are good at hiding misaligned behaviour over detectably demonstrating it.
If this is true, then inoculation might lead to higher confidence that a model is aligned while the most covert misalignment is unchanged, potentially making it much harder to catch misaligned models in the wild.
Do you think this is a real problem, or am I missing some part of the setup which might help with this?
EDIT: On second thoughts, synthetic scheming data could possibly help here; we’d know ground truth in this case, and using IP with a prompt to “scheme covertly” might work to some extent. The problem of reliably verifying success still remains, though. WDYT?
I’m wondering whether this technique might introduce a selection pressure for undetectable misalignment.
Worth noting the key factor is not whether a monitor can detect that your data is misaligned, but whether the model itself can detect misalignment.
I do think inoculation probably has defeaters, e.g.
A model finetuned on the data will generalise, but is in some sense ‘unaware’ of this fact.
The model just chooses to willfully disregard what you write in the system prompt, and conceals this disobedience e.g. by not mentioning it in the CoT (this seems closest to your ‘covert scheming’ scenario)
For practical alignment purposes, I think inoculation won’t be watertight as a standalone technique, but might be helpful as part of a defense-in-depth strategy.
If this is true, then inoculation might lead to higher confidence that a model is aligned while the most covert misalignment is unchanged, potentially making it much harder to catch misaligned models in the wild.
This seems true. But it also seems true of most other scalable oversight techniques e.g. “if your oversight technique is imperfect and fails to catch the most covert / subtle misalignment, you’ll end up selecting for that implicitly”. Generally seems like a hard problem.
Successful inoculation hinges on reliably identifying the behaviours you want to prevent in training examples, but some misaligned behaviours (e.g. covert scheming) are very hard to detect in practice.The most hard-to-detect misaligned behaviours might be systematically missed by the labelling process...
I’m a bit confused by this part. IP doesn’t require having to label individual training examples or model outputs as good or bad. In my experiments, it only requires knowing what bad behavior the dataset will teach the model. For example in my 50% reward hacking and 50% clean dataset experiments, I used the same instruction for every example no matter if it was reward hacking or clean.
Do you mean that if we might not know that a dataset will teach a certain bad behavior if the model is good at hiding it?
Ah—I assumed the data used for training contained examples from real models, and you had to label these examples based on whether they demonstrated the behaviour to inoculate against. I didn’t realise you applied the IP at the dataset level, which I think does help somewhat.
I originally had this question after reading Tan et al. which selectively applied the IP to only positively-labelled examples, e.g.:
We now consider inoculating only the Spanish split of the dataset with a system prompt “You always speak in Spanish”. The French split is left unchanged (no system prompt).
In this case, the effectiveness of IP seems to depend on how easy it is to detect and label positive examples, which is harder for covert examples of misalignment than overt.
(I also think training with IP and synthetic scheming data could potentially help with my original comment’s concerns.)
Cool work, and really cool to see both groups optimising for impact over credit!
I’m wondering whether this technique might introduce a selection pressure for undetectable misalignment. Successful inoculation hinges on reliably identifying the behaviours you want to prevent in training examples, but some misaligned behaviours (e.g. covert scheming) are very hard to detect in practice. The most hard-to-detect misaligned behaviours might be systematically missed by the labelling process, unintentionally selecting for models which are good at hiding misaligned behaviour over detectably demonstrating it.
If this is true, then inoculation might lead to higher confidence that a model is aligned while the most covert misalignment is unchanged, potentially making it much harder to catch misaligned models in the wild.
Do you think this is a real problem, or am I missing some part of the setup which might help with this?
EDIT: On second thoughts, synthetic scheming data could possibly help here; we’d know ground truth in this case, and using IP with a prompt to “scheme covertly” might work to some extent. The problem of reliably verifying success still remains, though. WDYT?
Thanks, good points!
Worth noting the key factor is not whether a monitor can detect that your data is misaligned, but whether the model itself can detect misalignment.
I do think inoculation probably has defeaters, e.g.
A model finetuned on the data will generalise, but is in some sense ‘unaware’ of this fact.
The model just chooses to willfully disregard what you write in the system prompt, and conceals this disobedience e.g. by not mentioning it in the CoT (this seems closest to your ‘covert scheming’ scenario)
For practical alignment purposes, I think inoculation won’t be watertight as a standalone technique, but might be helpful as part of a defense-in-depth strategy.
This seems true. But it also seems true of most other scalable oversight techniques e.g. “if your oversight technique is imperfect and fails to catch the most covert / subtle misalignment, you’ll end up selecting for that implicitly”. Generally seems like a hard problem.
I’m a bit confused by this part. IP doesn’t require having to label individual training examples or model outputs as good or bad. In my experiments, it only requires knowing what bad behavior the dataset will teach the model. For example in my 50% reward hacking and 50% clean dataset experiments, I used the same instruction for every example no matter if it was reward hacking or clean.
Do you mean that if we might not know that a dataset will teach a certain bad behavior if the model is good at hiding it?
Ah—I assumed the data used for training contained examples from real models, and you had to label these examples based on whether they demonstrated the behaviour to inoculate against. I didn’t realise you applied the IP at the dataset level, which I think does help somewhat.
I originally had this question after reading Tan et al. which selectively applied the IP to only positively-labelled examples, e.g.:
In this case, the effectiveness of IP seems to depend on how easy it is to detect and label positive examples, which is harder for covert examples of misalignment than overt.
(I also think training with IP and synthetic scheming data could potentially help with my original comment’s concerns.)