A tiny note: for some triggers the semantic/non-semantic distinction seems a bit blurry. For example, consider the trigger pip install torch==2.12.0 - this is a hardcoded string, but also points to a pretty well defined moment in the future (“version 2.12 was released”, we should expect that it in a year or so).
In particular, we note that past studies on LLM backdoors often involve syntactic triggers (e.g. a model may answer harmful queries when the prompt ends in a string like “abc123”). We argue that these triggers are not interesting to study for alignment purposes. Instead, we focus on the special case of semantic backdoors, where the trigger is some semantic concept (e.g. the model detects it has low oversight), because these backdoors better classify undesirable natural mechanisms, rather than adversarial inputs.
What about the following scenario:
You have a weaker trusted model and stronger untrusted model
You distill the stronger one into the weaker one
This stronger model plants a subtle non-semantic trigger because it is misaligned and predicted that this non-semantic trigger will trigger the unwanted behavior in the right circumstances
I guess it’s pretty far fetched? Maybe not worth considering.
Good points. I think I was trying to communicate that we were mostly interested in triggers and mechanisms that had some kind of naturalistic interpretation (I would count [“torch 2.12” ⇒ we are in the future ⇒ …] as semantic, if the model also accepts “torch 2.13″ or numpy versions etc.), which are probably more fitting for naturally occuring misalignment and are likely easier to reverse, rather than mechanisms that appear more “hardcoded” or planted, although it’s true that there is plausibly some potential where models plant such things in themselves. Made some edits to the post; thanks for the feedback!
Interesting read, thank you.
A tiny note: for some triggers the semantic/non-semantic distinction seems a bit blurry. For example, consider the trigger
pip install torch==2.12.0
- this is a hardcoded string, but also points to a pretty well defined moment in the future (“version 2.12 was released”, we should expect that it in a year or so).What about the following scenario:
You have a weaker trusted model and stronger untrusted model
You distill the stronger one into the weaker one
This stronger model plants a subtle non-semantic trigger because it is misaligned and predicted that this non-semantic trigger will trigger the unwanted behavior in the right circumstances
I guess it’s pretty far fetched? Maybe not worth considering.
Good points. I think I was trying to communicate that we were mostly interested in triggers and mechanisms that had some kind of naturalistic interpretation (I would count [“torch 2.12” ⇒ we are in the future ⇒ …] as semantic, if the model also accepts “torch 2.13″ or numpy versions etc.), which are probably more fitting for naturally occuring misalignment and are likely easier to reverse, rather than mechanisms that appear more “hardcoded” or planted, although it’s true that there is plausibly some potential where models plant such things in themselves. Made some edits to the post; thanks for the feedback!