For something such as AI safety, having AI itself solve the problem is largely problematic when we need provable outcomes without hidden behaviors. The circular system can easily mask problems. But all of this fundamentally is advancing past the harder problems. That being algorithms cannot encode the concepts we wish to impart to the AI. Abstractly we can make alignment systems that seem to make sense, but can never implement them concretely.
As it is today, every LLM released is jailbroken within minutes. We can’t protect behaviors of which the attack surface is anything that can be expressed by human language.
For something such as AI safety, having AI itself solve the problem is largely problematic when we need provable outcomes without hidden behaviors. The circular system can easily mask problems. But all of this fundamentally is advancing past the harder problems. That being algorithms cannot encode the concepts we wish to impart to the AI. Abstractly we can make alignment systems that seem to make sense, but can never implement them concretely.
As it is today, every LLM released is jailbroken within minutes. We can’t protect behaviors of which the attack surface is anything that can be expressed by human language.
I elaborate on these points extensively here in AI Alignment: Why Solving It Is Impossible