Nobody is asking that the AI can also generalize to “optimize human values as well as the best available combination of skills it has otherwise...” at least, I wasn’t asking that. (At no point did I assume that fully general means ‘equally good’ at all tasks. I am not even sure such comparisons can be made.) But now rereading your comments it seems you were all along, since you brought up competitiveness worries. So now maybe I understand you better: you are assuming a hypercompetitive takeoff in which if there are AIs running around optimized to play the training game or something, and then we use interpretability tools to intervene on some of them and make them optimize for long-run human values instead, they won’t be as good at it as they were at playing the training game, even though they will be able to do it (compare: humans can optimize for constructing large cubes of iron, but they aren’t as good at it as they are at optimizing for status) and so they’ll lose competitions to the remaining AIs that haven’t been modified?
(My response to this would be ah, this makes sense, but I don’t expect there to be this much competition so I’m not bothered by this problem. I think if we have the interpretability tools we’ll probably be able to retarget the search of all relevant AIs, and then they’ll optimize for human values inefficiently but well enough to save the day.)
I think competitiveness matters a lot even if there’s only moderate amounts of competitive pressure. The gaps in efficiency I’m imagining are less “10x worse” and more like “I only had support vector machines and you had SGD”
Nobody is asking that the AI can also generalize to “optimize human values as well as the best available combination of skills it has otherwise...” at least, I wasn’t asking that. (At no point did I assume that fully general means ‘equally good’ at all tasks. I am not even sure such comparisons can be made.) But now rereading your comments it seems you were all along, since you brought up competitiveness worries. So now maybe I understand you better: you are assuming a hypercompetitive takeoff in which if there are AIs running around optimized to play the training game or something, and then we use interpretability tools to intervene on some of them and make them optimize for long-run human values instead, they won’t be as good at it as they were at playing the training game, even though they will be able to do it (compare: humans can optimize for constructing large cubes of iron, but they aren’t as good at it as they are at optimizing for status) and so they’ll lose competitions to the remaining AIs that haven’t been modified?
(My response to this would be ah, this makes sense, but I don’t expect there to be this much competition so I’m not bothered by this problem. I think if we have the interpretability tools we’ll probably be able to retarget the search of all relevant AIs, and then they’ll optimize for human values inefficiently but well enough to save the day.)
I think competitiveness matters a lot even if there’s only moderate amounts of competitive pressure. The gaps in efficiency I’m imagining are less “10x worse” and more like “I only had support vector machines and you had SGD”