Perhaps the goal for existing work targeting AI safety is less to ensure that AI safety happens, and more to make sure that we make AI systems that are strictly[1] better than the current researchers at figuring out what to do about AI safety.
I’m unsure how hard AI safety is. But I consider it fairly likely that mid-term (maybe 50% of the way to TAI, in years) safe AI systems are likely to outperform humans on AI safety strategy and the large majority of the research work.
If humans can successfully bootstrap more capable infrastructure than us, then our (humans) main work is done (though there could still be other work we can help with).
It might well be the case that the resulting AI systems would recognize that the situation is fairly hopeless. But at that point, humans have done they key things they need to do on this, hopeless or not. Our job is to set things up as best we can, more is by definition impossible.
Personally, I feel very doomy about humans now solving for various alignment problems of many years from now. But I feel much better about us making systems that will do a better job at guiding things then we could.
(The empirical question here is how difficult it is to automate alignment research. I realize this is a controversial and discussed topic. My guess is that many researchers will never agree with good AI systems, and always hold out on considering them superior—and that on the flip side, many people will trust AIs before they really should. Getting this right is definitely tricky.)
[1] Strictly meaning that they’re very likely better overall, not that there’s absolutely no area humans will be better than them.
But I consider it fairly likely that mid-term (maybe 50% of the way to TAI, in years) safe AI systems are likely to outperform humans on AI safety strategy and the large majority of the research work.
The best humans, or the median humans who do that work, or something else?
(Quick Thought)
Perhaps the goal for existing work targeting AI safety is less to ensure that AI safety happens, and more to make sure that we make AI systems that are strictly[1] better than the current researchers at figuring out what to do about AI safety.
I’m unsure how hard AI safety is. But I consider it fairly likely that mid-term (maybe 50% of the way to TAI, in years) safe AI systems are likely to outperform humans on AI safety strategy and the large majority of the research work.
If humans can successfully bootstrap more capable infrastructure than us, then our (humans) main work is done (though there could still be other work we can help with).
It might well be the case that the resulting AI systems would recognize that the situation is fairly hopeless. But at that point, humans have done they key things they need to do on this, hopeless or not. Our job is to set things up as best we can, more is by definition impossible.
Personally, I feel very doomy about humans now solving for various alignment problems of many years from now. But I feel much better about us making systems that will do a better job at guiding things then we could.
(The empirical question here is how difficult it is to automate alignment research. I realize this is a controversial and discussed topic. My guess is that many researchers will never agree with good AI systems, and always hold out on considering them superior—and that on the flip side, many people will trust AIs before they really should. Getting this right is definitely tricky.)
[1] Strictly meaning that they’re very likely better overall, not that there’s absolutely no area humans will be better than them.
The best humans, or the median humans who do that work, or something else?
The humans trusted to make decisions.
I’m hesitant to say “best humans”, because who knows how many smart people there may be out there who might luck out or something.
But “the people making decisions on this, including in key EA orgs/spending” is a much more understandable bar.