You may have a point that this is a crux for some. I think I...mostly reject the framing of “worst-case” and “average-case” “alignment”. I claim models are not aligned, period. I claim “doing what the operators want most of the time” is not alignment and should not be mistaken for it.
The scenario I am most concerned about involves AIs trained on and tasked with thinking about the deep implications of AI values. Such AIs probably get better at noticing their own. This seems like the “default” and “normal” case to me, and it seems almost unavoidable that deep misalignment begins to surface at that point.
Even if AIs did not do this sort of AI research, though, competence and internal coherence seem hard to disentangle from each other.
You may have a point that this is a crux for some. I think I...mostly reject the framing of “worst-case” and “average-case” “alignment”. I claim models are not aligned, period. I claim “doing what the operators want most of the time” is not alignment and should not be mistaken for it.
The scenario I am most concerned about involves AIs trained on and tasked with thinking about the deep implications of AI values. Such AIs probably get better at noticing their own. This seems like the “default” and “normal” case to me, and it seems almost unavoidable that deep misalignment begins to surface at that point.
Even if AIs did not do this sort of AI research, though, competence and internal coherence seem hard to disentangle from each other.