I think I don’t disagree. If there’s something I’m trying to “defend” here, it would be like “later, when some of the medium-hard parts are kinda solved, I’m not going to update very much about the hard parts, and you can’t accuse me of goalpost moving” and maybe “and you can kinda see that the goalposts are far away, but only by generalizing past the current difficulties”.
Sure, that’s one way to put this. But actually scalable oversight might give another way of framing this, making the intermediate parts of the problem even more important than the hardest parts (within the assumptions of the framing). The story there is that AIs immediately prior to either RSI or superintelligence are going to be the key tools in solving alignment for RSI or superintelligence, so it’s important for human developers to solve alignment up to that point, but it’s not important for human developers who don’t yet have access to such tools to solve the rest of the problem.
And when prosaic alignment methods look to be mostly solving the easy part of the problem, the standard modesty pronouncements about how these methods don’t directly apply to the hardest lethal part of the problem can be both earnestly acknowledged and fail to help at all, because it’s not seen as a relevant part of the problem to work on in the current regime. So if pretraining/RLVR/RLHF seems to be working, concluding that we are probably fine until RSI or superintelligence is tantamount to concluding that we are probably fine overall (within this framing around scalable oversight).
Thus noticing a major difficulty with the middle part of the problem becomes a crux for expectations about AI danger overall, even for some people who already acknowledge the difficulty of the hardest part of the problem.
Ok. If I’m following, I think I agree, except that I’d probably say “you mostly need to solve [what I’m calling the hard parts] in order to solve intermediate alignment well enough for pre-strong AIs to be the engines of alignment progress”. So either I’m wrong about what the hard parts are, or you actually need to solve the hard parts to get scalable oversight (and therefore it doesn’t really help much).
I think I don’t disagree. If there’s something I’m trying to “defend” here, it would be like “later, when some of the medium-hard parts are kinda solved, I’m not going to update very much about the hard parts, and you can’t accuse me of goalpost moving” and maybe “and you can kinda see that the goalposts are far away, but only by generalizing past the current difficulties”.
Sure, that’s one way to put this. But actually scalable oversight might give another way of framing this, making the intermediate parts of the problem even more important than the hardest parts (within the assumptions of the framing). The story there is that AIs immediately prior to either RSI or superintelligence are going to be the key tools in solving alignment for RSI or superintelligence, so it’s important for human developers to solve alignment up to that point, but it’s not important for human developers who don’t yet have access to such tools to solve the rest of the problem.
And when prosaic alignment methods look to be mostly solving the easy part of the problem, the standard modesty pronouncements about how these methods don’t directly apply to the hardest lethal part of the problem can be both earnestly acknowledged and fail to help at all, because it’s not seen as a relevant part of the problem to work on in the current regime. So if pretraining/RLVR/RLHF seems to be working, concluding that we are probably fine until RSI or superintelligence is tantamount to concluding that we are probably fine overall (within this framing around scalable oversight).
Thus noticing a major difficulty with the middle part of the problem becomes a crux for expectations about AI danger overall, even for some people who already acknowledge the difficulty of the hardest part of the problem.
Ok. If I’m following, I think I agree, except that I’d probably say “you mostly need to solve [what I’m calling the hard parts] in order to solve intermediate alignment well enough for pre-strong AIs to be the engines of alignment progress”. So either I’m wrong about what the hard parts are, or you actually need to solve the hard parts to get scalable oversight (and therefore it doesn’t really help much).