I think you mean that seemingly-aligned AI is the most dangerous kind? Big difference. This issue is discussed frequently, although I don’t have a really good reference for it.
Yes and no. Yes in principle but no in practicality. The point I am making is that with regards to all possible measures, seemingly aligned and aligned models are indistinguishable. I agree that many people are aware and it has been talked about for a long time on this forum but I am surprised that evidence of apparent alignment is still considered as progress towards the later by a lot of people. Whereas I see later having a measurable evidence almost impossible.
Thanks for pointing that out and for the engagement despite that. I have changed the title and added a short note on the edit.
What really bothers me is the trajectory we seem to have chosen: Continue to scale the models and monitor them for misalignment. This plan has some obvious flaws:
1. Verifiable way to know if we got an aligned or seemingly aligned AI as a result; since evaluations can’t distinguish between the two.
2. White-box techniques seem to be pretty limited currently and it is uncertain if we will get distinguishable signals if most reliable techniques are developed past a certain(unknown) capability mark.
3. If we continue moving towards more automated pipelines, because we feel it is safe, we won’t be able to limit catastrophes.
I don’t see much push from technical perspective against this trajectory. For people starting out work in AI safety from technical perspective, I don’t see many suggestions that challenge this trajectory and propose alternatives. I see there is scientist AI from Yoshua Bengio but it doesn’t seem to be discussed as much. I see theoretical work with Simplex and SLT, but seems still at beginning stage.
And since most beginner work is about replicating and extending current work, it creates this chain-reaction, and as a result, a majority of the work ends up being the one following the same trajectory.
I think you mean that seemingly-aligned AI is the most dangerous kind? Big difference. This issue is discussed frequently, although I don’t have a really good reference for it.
Yes and no. Yes in principle but no in practicality. The point I am making is that with regards to all possible measures, seemingly aligned and aligned models are indistinguishable. I agree that many people are aware and it has been talked about for a long time on this forum but I am surprised that evidence of apparent alignment is still considered as progress towards the later by a lot of people. Whereas I see later having a measurable evidence almost impossible.
I got that. I’m pointing out that you’re probably being downvoted because your title is quite inaccurate.
Thanks for pointing that out and for the engagement despite that. I have changed the title and added a short note on the edit.
What really bothers me is the trajectory we seem to have chosen: Continue to scale the models and monitor them for misalignment. This plan has some obvious flaws:
1. Verifiable way to know if we got an aligned or seemingly aligned AI as a result; since evaluations can’t distinguish between the two.
2. White-box techniques seem to be pretty limited currently and it is uncertain if we will get distinguishable signals if most reliable techniques are developed past a certain(unknown) capability mark.
3. If we continue moving towards more automated pipelines, because we feel it is safe, we won’t be able to limit catastrophes.
I don’t see much push from technical perspective against this trajectory. For people starting out work in AI safety from technical perspective, I don’t see many suggestions that challenge this trajectory and propose alternatives. I see there is scientist AI from Yoshua Bengio but it doesn’t seem to be discussed as much. I see theoretical work with Simplex and SLT, but seems still at beginning stage.
And since most beginner work is about replicating and extending current work, it creates this chain-reaction, and as a result, a majority of the work ends up being the one following the same trajectory.