Some vague idea: Alignment can be fragile. Can capabilities be made fragile too?
I think fragile capabilities can be potentially useful in situations that needs to prevent tampering the model, eg finetuning a model to jailbreak / learn dangerous bioweapon capabilities.
That’s an excellent idea! I believe a similar approach can be used for model capabilities as well, but it may also prevent benign users from updating their models as well. Still, achieving fragile capabilities for adversarial updates but preserving them for benign updates seems doable to me.
Some vague idea: Alignment can be fragile. Can capabilities be made fragile too?
I think fragile capabilities can be potentially useful in situations that needs to prevent tampering the model, eg finetuning a model to jailbreak / learn dangerous bioweapon capabilities.
That’s an excellent idea! I believe a similar approach can be used for model capabilities as well, but it may also prevent benign users from updating their models as well. Still, achieving fragile capabilities for adversarial updates but preserving them for benign updates seems doable to me.