It’s not that you’ve mostly solved the whole problem, it’s that you’ve 100% solved (the easy) half of the problem, and 0% solved (the hard) half of the problem.
Let’s say there are three parts of the problem, the easy solved part, the next thing middle part, and the hard lethal part. The claim that we 0% solved the hard part of the problem is quite popular, that current prosaic alignment methods don’t apply to RSI or superintelligence. There might be other things suggested as helping with RSI or superintelligence, such as scalable oversight, but that’s not what I’m talking about.
This thread is about the middle part of the problem, that having mostly solved the easy part of the problem shouldn’t be construed as significant evidence that we are OK until RSI or superintelligence. The failure to generalize can start applying earlier than the hardest part of the problem, which is an immediate concern of its own. Only focusing on this immediate concern might distract from the further generalization to the hardest part of the problem, but that’s not a reason to endorse failure to notice the immediate concern, these issues shouldn’t compete with each other.
I think I don’t disagree. If there’s something I’m trying to “defend” here, it would be like “later, when some of the medium-hard parts are kinda solved, I’m not going to update very much about the hard parts, and you can’t accuse me of goalpost moving” and maybe “and you can kinda see that the goalposts are far away, but only by generalizing past the current difficulties”.
Sure, that’s one way to put this. But actually scalable oversight might give another way of framing this, making the intermediate parts of the problem even more important than the hardest parts (within the assumptions of the framing). The story there is that AIs immediately prior to either RSI or superintelligence are going to be the key tools in solving alignment for RSI or superintelligence, so it’s important for human developers to solve alignment up to that point, but it’s not important for human developers who don’t yet have access to such tools to solve the rest of the problem.
And when prosaic alignment methods look to be mostly solving the easy part of the problem, the standard modesty pronouncements about how these methods don’t directly apply to the hardest lethal part of the problem can be both earnestly acknowledged and fail to help at all, because it’s not seen as a relevant part of the problem to work on in the current regime. So if pretraining/RLVR/RLHF seems to be working, concluding that we are probably fine until RSI or superintelligence is tantamount to concluding that we are probably fine overall (within this framing around scalable oversight).
Thus noticing a major difficulty with the middle part of the problem becomes a crux for expectations about AI danger overall, even for some people who already acknowledge the difficulty of the hardest part of the problem.
Ok. If I’m following, I think I agree, except that I’d probably say “you mostly need to solve [what I’m calling the hard parts] in order to solve intermediate alignment well enough for pre-strong AIs to be the engines of alignment progress”. So either I’m wrong about what the hard parts are, or you actually need to solve the hard parts to get scalable oversight (and therefore it doesn’t really help much).
Let’s say there are three parts of the problem, the easy solved part, the next thing middle part, and the hard lethal part. The claim that we 0% solved the hard part of the problem is quite popular, that current prosaic alignment methods don’t apply to RSI or superintelligence. There might be other things suggested as helping with RSI or superintelligence, such as scalable oversight, but that’s not what I’m talking about.
This thread is about the middle part of the problem, that having mostly solved the easy part of the problem shouldn’t be construed as significant evidence that we are OK until RSI or superintelligence. The failure to generalize can start applying earlier than the hardest part of the problem, which is an immediate concern of its own. Only focusing on this immediate concern might distract from the further generalization to the hardest part of the problem, but that’s not a reason to endorse failure to notice the immediate concern, these issues shouldn’t compete with each other.
I think I don’t disagree. If there’s something I’m trying to “defend” here, it would be like “later, when some of the medium-hard parts are kinda solved, I’m not going to update very much about the hard parts, and you can’t accuse me of goalpost moving” and maybe “and you can kinda see that the goalposts are far away, but only by generalizing past the current difficulties”.
Sure, that’s one way to put this. But actually scalable oversight might give another way of framing this, making the intermediate parts of the problem even more important than the hardest parts (within the assumptions of the framing). The story there is that AIs immediately prior to either RSI or superintelligence are going to be the key tools in solving alignment for RSI or superintelligence, so it’s important for human developers to solve alignment up to that point, but it’s not important for human developers who don’t yet have access to such tools to solve the rest of the problem.
And when prosaic alignment methods look to be mostly solving the easy part of the problem, the standard modesty pronouncements about how these methods don’t directly apply to the hardest lethal part of the problem can be both earnestly acknowledged and fail to help at all, because it’s not seen as a relevant part of the problem to work on in the current regime. So if pretraining/RLVR/RLHF seems to be working, concluding that we are probably fine until RSI or superintelligence is tantamount to concluding that we are probably fine overall (within this framing around scalable oversight).
Thus noticing a major difficulty with the middle part of the problem becomes a crux for expectations about AI danger overall, even for some people who already acknowledge the difficulty of the hardest part of the problem.
Ok. If I’m following, I think I agree, except that I’d probably say “you mostly need to solve [what I’m calling the hard parts] in order to solve intermediate alignment well enough for pre-strong AIs to be the engines of alignment progress”. So either I’m wrong about what the hard parts are, or you actually need to solve the hard parts to get scalable oversight (and therefore it doesn’t really help much).