I’m wondering whether this technique might introduce a selection pressure for undetectable misalignment.
Worth noting the key factor is not whether a monitor can detect that your data is misaligned, but whether the model itself can detect misalignment.
I do think inoculation probably has defeaters, e.g.
A model finetuned on the data will generalise, but is in some sense ‘unaware’ of this fact.
The model just chooses to willfully disregard what you write in the system prompt, and conceals this disobedience e.g. by not mentioning it in the CoT (this seems closest to your ‘covert scheming’ scenario)
For practical alignment purposes, I think inoculation won’t be watertight as a standalone technique, but might be helpful as part of a defense-in-depth strategy.
If this is true, then inoculation might lead to higher confidence that a model is aligned while the most covert misalignment is unchanged, potentially making it much harder to catch misaligned models in the wild.
This seems true. But it also seems true of most other scalable oversight techniques e.g. “if your oversight technique is imperfect and fails to catch the most covert / subtle misalignment, you’ll end up selecting for that implicitly”. Generally seems like a hard problem.
Thanks, good points!
Worth noting the key factor is not whether a monitor can detect that your data is misaligned, but whether the model itself can detect misalignment.
I do think inoculation probably has defeaters, e.g.
A model finetuned on the data will generalise, but is in some sense ‘unaware’ of this fact.
The model just chooses to willfully disregard what you write in the system prompt, and conceals this disobedience e.g. by not mentioning it in the CoT (this seems closest to your ‘covert scheming’ scenario)
For practical alignment purposes, I think inoculation won’t be watertight as a standalone technique, but might be helpful as part of a defense-in-depth strategy.
This seems true. But it also seems true of most other scalable oversight techniques e.g. “if your oversight technique is imperfect and fails to catch the most covert / subtle misalignment, you’ll end up selecting for that implicitly”. Generally seems like a hard problem.