Luo Ling comments on LLM Misalignment Can be One Gradient Step Away, and Blackbox Evaluation Cannot Detect It.

Luo Ling 15 Mar 2026 11:48 UTC
5 points
2
Some vague idea: Alignment can be fragile. Can capabilities be made fragile too?
I think fragile capabilities can be potentially useful in situations that needs to prevent tampering the model, eg finetuning a model to jailbreak / learn dangerous bioweapon capabilities.
- Yavuz Bakman 15 Mar 2026 19:37 UTC
  2 points
  −1
  Parent
  That’s an excellent idea! I believe a similar approach can be used for model capabilities as well, but it may also prevent benign users from updating their models as well. Still, achieving fragile capabilities for adversarial updates but preserving them for benign updates seems doable to me.