(Seemingly) Aligned AI is the most dangerous AI

I see this post on the state of progress on alignment and others talk about AI getting aligned or better aligned than they initially expected.

But if you were not worried about catastrophes from misuse or mistake but specifically from x-risks from AI because of loss of control, it is exactly the kind of AI that is most dangerous. [1]

I am surprised that it is not obvious to more people that:

  1. A rogue AI will act exactly the same as an aligned AI. Evaluations can only tell if models are misaligned and not if they are aligned.

  2. AI models will get better at the alignment evaluations as models get better because most of the alignment evaluations are designed to catch misalignment that results from our failure to align these machines properly. As models get better the alignment would get better. (if alignment evaluations are how you measure alignment)

  3. The core issue is about agency and power concentration. And we don’t have enough people talking about restricting that. If a version of model while training were to go misaligned to the extent that it wants to acquire more power, the only line of defence is how much of the agency to acquire power that model have. It seems that we have very few people focussing on that. We already have tool use, internet access given to current models and we are actively moving towards continual learning where models can actively change their own weights. Moreover, with dynamic weights, any reliability of evaluations is out of the picture, since models’ weights are continuously updating.

Edits: I changed the title of the article as it was considered misleading. I chose a clickbait title earlier: Aligned AI is the most dangerous AI to draw attention towards impossibility(?) of distinguishing aligned and seemingly aligned models via evaluations. I felt not that many people were talking about it and it has big implications on what conclusions we draw from empirical results on alignment evaluations.

  1. ^

    I am not claiming that the current AI models are deceptively aligned or even close but the limitations of ever knowing if would be.