Yavuz Bakman

Karma: 34

Thinking about AI Alignment and Reliability.

Yavuz Bakman 15 Mar 2026 19:37 UTC
2 points
−1
in reply to: Luo Ling’s comment on: LLM Misalignment Can be One Gradient Step Away, and Blackbox Evaluation Cannot Detect It.
That’s an excellent idea! I believe a similar approach can be used for model capabilities as well, but it may also prevent benign users from updating their models as well. Still, achieving fragile capabilities for adversarial updates but preserving them for benign updates seems doable to me.

LLM Misalignment Can be One Gradient Step Away, and Blackbox Evaluation Cannot Detect It.

Yavuz Bakman15 Mar 2026 0:19 UTC

34 points