I’m starting to suspect that one of the cruxes on AI alignment/AI safety debates is whether or not we need worst-case alignment, where the model essentially never messes up in it’s alignment to human operators, or whether for the purposes of automating AI alignment, we only need average-case alignment that doesn’t focus on the extreme cases.
All of the examples you have given would definitely fail a worst-case alignment test, and if you believe that we need to deal with worst-case scenarios commonly, this would point to damning alignment problems in the future, but if you believe that we don’t need to deal with/assume the worst case scenarios to use/automate away human jobs with AI, then the examples given don’t actually matter for whether AI safety is going to be solved or not by default, because they’d readily admit that these examples are relatively extreme/not using the model normally.
You may have a point that this is a crux for some. I think I...mostly reject the framing of “worst-case” and “average-case” “alignment”. I claim models are not aligned, period. I claim “doing what the operators want most of the time” is not alignment and should not be mistaken for it.
The scenario I am most concerned about involves AIs trained on and tasked with thinking about the deep implications of AI values. Such AIs probably get better at noticing their own. This seems like the “default” and “normal” case to me, and it seems almost unavoidable that deep misalignment begins to surface at that point.
Even if AIs did not do this sort of AI research, though, competence and internal coherence seem hard to disentangle from each other.
I’m starting to suspect that one of the cruxes on AI alignment/AI safety debates is whether or not we need worst-case alignment, where the model essentially never messes up in it’s alignment to human operators, or whether for the purposes of automating AI alignment, we only need average-case alignment that doesn’t focus on the extreme cases.
All of the examples you have given would definitely fail a worst-case alignment test, and if you believe that we need to deal with worst-case scenarios commonly, this would point to damning alignment problems in the future, but if you believe that we don’t need to deal with/assume the worst case scenarios to use/automate away human jobs with AI, then the examples given don’t actually matter for whether AI safety is going to be solved or not by default, because they’d readily admit that these examples are relatively extreme/not using the model normally.
You may have a point that this is a crux for some. I think I...mostly reject the framing of “worst-case” and “average-case” “alignment”. I claim models are not aligned, period. I claim “doing what the operators want most of the time” is not alignment and should not be mistaken for it.
The scenario I am most concerned about involves AIs trained on and tasked with thinking about the deep implications of AI values. Such AIs probably get better at noticing their own. This seems like the “default” and “normal” case to me, and it seems almost unavoidable that deep misalignment begins to surface at that point.
Even if AIs did not do this sort of AI research, though, competence and internal coherence seem hard to disentangle from each other.