(Seemingly) Aligned AI is the most dangerous AI
I see this post on the state of progress on alignment and others talk about AI getting aligned or better aligned than they initially expected.
But if you were not worried about catastrophes from misuse or mistake but specifically from x-risks from AI because of loss of control, it is exactly the kind of AI that is most dangerous. [1]
I am surprised that it is not obvious to more people that:
A rogue AI will act exactly the same as an aligned AI. Evaluations can only tell if models are misaligned and not if they are aligned.
AI models will get better at the alignment evaluations as models get better because most of the alignment evaluations are designed to catch misalignment that results from our failure to align these machines properly. As models get better the alignment would get better. (if alignment evaluations are how you measure alignment)
The core issue is about agency and power concentration. And we don’t have enough people talking about restricting that. If a version of model while training were to go misaligned to the extent that it wants to acquire more power, the only line of defence is how much of the agency to acquire power that model have. It seems that we have very few people focussing on that. We already have tool use, internet access given to current models and we are actively moving towards continual learning where models can actively change their own weights. Moreover, with dynamic weights, any reliability of evaluations is out of the picture, since models’ weights are continuously updating.
Edits: I changed the title of the article as it was considered misleading. I chose a clickbait title earlier: Aligned AI is the most dangerous AI to draw attention towards impossibility(?) of distinguishing aligned and seemingly aligned models via evaluations. I felt not that many people were talking about it and it has big implications on what conclusions we draw from empirical results on alignment evaluations.
- ^
I am not claiming that the current AI models are deceptively aligned or even close but the limitations of ever knowing if would be.
I think you mean that seemingly-aligned AI is the most dangerous kind? Big difference. This issue is discussed frequently, although I don’t have a really good reference for it.
Yes and no. Yes in principle but no in practicality. The point I am making is that with regards to all possible measures, seemingly aligned and aligned models are indistinguishable. I agree that many people are aware and it has been talked about for a long time on this forum but I am surprised that evidence of apparent alignment is still considered as progress towards the later by a lot of people. Whereas I see later having a measurable evidence almost impossible.
I got that. I’m pointing out that you’re probably being downvoted because your title is quite inaccurate.
Thanks for pointing that out and for the engagement despite that. I have changed the title and added a short note on the edit.
What really bothers me is the trajectory we seem to have chosen: Continue to scale the models and monitor them for misalignment. This plan has some obvious flaws:
1. Verifiable way to know if we got an aligned or seemingly aligned AI as a result; since evaluations can’t distinguish between the two.
2. White-box techniques seem to be pretty limited currently and it is uncertain if we will get distinguishable signals if most reliable techniques are developed past a certain(unknown) capability mark.
3. If we continue moving towards more automated pipelines, because we feel it is safe, we won’t be able to limit catastrophes.
I don’t see much push from technical perspective against this trajectory. For people starting out work in AI safety from technical perspective, I don’t see many suggestions that challenge this trajectory and propose alternatives. I see there is scientist AI from Yoshua Bengio but it doesn’t seem to be discussed as much. I see theoretical work with Simplex and SLT, but seems still at beginning stage.
And since most beginner work is about replicating and extending current work, it creates this chain-reaction, and as a result, a majority of the work ends up being the one following the same trajectory.