Yavuz Bakman comments on LLM Misalignment Can be One Gradient Step Away, and Blackbox Evaluation Cannot Detect It.

Yavuz Bakman 15 Mar 2026 19:37 UTC
2 points
−1
That’s an excellent idea! I believe a similar approach can be used for model capabilities as well, but it may also prevent benign users from updating their models as well. Still, achieving fragile capabilities for adversarial updates but preserving them for benign updates seems doable to me.