Thinking about AI Alignment and Reliability.
Enjoying Soulsborne games.
Yavuz Bakman
Karma: 30
That’s an excellent idea! I believe a similar approach can be used for model capabilities as well, but it may also prevent benign users from updating their models as well. Still, achieving fragile capabilities for adversarial updates but preserving them for benign updates seems doable to me.
very relevant work!
I think when the capabilities of the model increase, guardrails can be fooled by the models more easily, which is why it wouldn’t be a good solution at that time. But these days, they are still quite powerful, and I guess anthropic deploys guardrail models in production.