the strategicness of the persona (and its propensities) isn’t necessarily particularly connected to the process that created the topic-specific capabilities.
misleading in that the pressure to abandon it to enable continual learning is going to be strong.
Right—in general, the ongoing source of capabilities is what’s at question, and all “alignment” methods that ignore this are useless.
The current pretraining/RLVR/RLHF process illustrates how practical alignment methods plausibly get to defeat a source of capabilities that are quite strong by many metrics. These alignment methods are specific to their niche, they are probably going to be useless outside of it, to deal with new sources of different kinds of capabilities, but they are quite useful where they apply.
So I’m not making a general point about alignment methods and sources of capabilities. I’m merely gesturing at a plausible next source of capabilities that the current methods might very soon become helpless to manage. Modesty memes make RSI or superintelligence too salient as something the current methods of alignment won’t yet be able to defeat, it might be useful to look at the weakest next thing instead.
I think the issue with looking at the weakest next thing is that you don’t generalize. The lack of generalization might be ok, but sometimes it’s not ok. As an example, suppose you have a method which can interpret an AI that’s “fixed” (whatever that might mean). Suppose further that each time there’s an “AI advance”, this method requires some additional work to make it able to interpret a new “fixed” AI. You might say that interpretability is mostly solved. But you might be missing that actually what’s happening is that your method is good at interpreting mental content that’s already been made explicit (https://tsvibt.blogspot.com/2023/03/explicitness.html), and is bad at interpreting mental content that is inexplicit. It’s not that you’ve mostly solved the whole problem, it’s that you’ve 100% solved (the easy) half of the problem, and 0% solved (the hard) half of the problem.
It’s not that you’ve mostly solved the whole problem, it’s that you’ve 100% solved (the easy) half of the problem, and 0% solved (the hard) half of the problem.
Let’s say there are three parts of the problem, the easy solved part, the next thing middle part, and the hard lethal part. The claim that we 0% solved the hard part of the problem is quite popular, that current prosaic alignment methods don’t apply to RSI or superintelligence. There might be other things suggested as helping with RSI or superintelligence, such as scalable oversight, but that’s not what I’m talking about.
This thread is about the middle part of the problem, that having mostly solved the easy part of the problem shouldn’t be construed as significant evidence that we are OK until RSI or superintelligence. The failure to generalize can start applying earlier than the hardest part of the problem, which is an immediate concern of its own. Only focusing on this immediate concern might distract from the further generalization to the hardest part of the problem, but that’s not a reason to endorse failure to notice the immediate concern, these issues shouldn’t compete with each other.
I think I don’t disagree. If there’s something I’m trying to “defend” here, it would be like “later, when some of the medium-hard parts are kinda solved, I’m not going to update very much about the hard parts, and you can’t accuse me of goalpost moving” and maybe “and you can kinda see that the goalposts are far away, but only by generalizing past the current difficulties”.
Sure, that’s one way to put this. But actually scalable oversight might give another way of framing this, making the intermediate parts of the problem even more important than the hardest parts (within the assumptions of the framing). The story there is that AIs immediately prior to either RSI or superintelligence are going to be the key tools in solving alignment for RSI or superintelligence, so it’s important for human developers to solve alignment up to that point, but it’s not important for human developers who don’t yet have access to such tools to solve the rest of the problem.
And when prosaic alignment methods look to be mostly solving the easy part of the problem, the standard modesty pronouncements about how these methods don’t directly apply to the hardest lethal part of the problem can be both earnestly acknowledged and fail to help at all, because it’s not seen as a relevant part of the problem to work on in the current regime. So if pretraining/RLVR/RLHF seems to be working, concluding that we are probably fine until RSI or superintelligence is tantamount to concluding that we are probably fine overall (within this framing around scalable oversight).
Thus noticing a major difficulty with the middle part of the problem becomes a crux for expectations about AI danger overall, even for some people who already acknowledge the difficulty of the hardest part of the problem.
Ok. If I’m following, I think I agree, except that I’d probably say “you mostly need to solve [what I’m calling the hard parts] in order to solve intermediate alignment well enough for pre-strong AIs to be the engines of alignment progress”. So either I’m wrong about what the hard parts are, or you actually need to solve the hard parts to get scalable oversight (and therefore it doesn’t really help much).
Right—in general, the ongoing source of capabilities is what’s at question, and all “alignment” methods that ignore this are useless.
The current pretraining/RLVR/RLHF process illustrates how practical alignment methods plausibly get to defeat a source of capabilities that are quite strong by many metrics. These alignment methods are specific to their niche, they are probably going to be useless outside of it, to deal with new sources of different kinds of capabilities, but they are quite useful where they apply.
So I’m not making a general point about alignment methods and sources of capabilities. I’m merely gesturing at a plausible next source of capabilities that the current methods might very soon become helpless to manage. Modesty memes make RSI or superintelligence too salient as something the current methods of alignment won’t yet be able to defeat, it might be useful to look at the weakest next thing instead.
I think the issue with looking at the weakest next thing is that you don’t generalize. The lack of generalization might be ok, but sometimes it’s not ok. As an example, suppose you have a method which can interpret an AI that’s “fixed” (whatever that might mean). Suppose further that each time there’s an “AI advance”, this method requires some additional work to make it able to interpret a new “fixed” AI. You might say that interpretability is mostly solved. But you might be missing that actually what’s happening is that your method is good at interpreting mental content that’s already been made explicit (https://tsvibt.blogspot.com/2023/03/explicitness.html), and is bad at interpreting mental content that is inexplicit. It’s not that you’ve mostly solved the whole problem, it’s that you’ve 100% solved (the easy) half of the problem, and 0% solved (the hard) half of the problem.
Let’s say there are three parts of the problem, the easy solved part, the next thing middle part, and the hard lethal part. The claim that we 0% solved the hard part of the problem is quite popular, that current prosaic alignment methods don’t apply to RSI or superintelligence. There might be other things suggested as helping with RSI or superintelligence, such as scalable oversight, but that’s not what I’m talking about.
This thread is about the middle part of the problem, that having mostly solved the easy part of the problem shouldn’t be construed as significant evidence that we are OK until RSI or superintelligence. The failure to generalize can start applying earlier than the hardest part of the problem, which is an immediate concern of its own. Only focusing on this immediate concern might distract from the further generalization to the hardest part of the problem, but that’s not a reason to endorse failure to notice the immediate concern, these issues shouldn’t compete with each other.
I think I don’t disagree. If there’s something I’m trying to “defend” here, it would be like “later, when some of the medium-hard parts are kinda solved, I’m not going to update very much about the hard parts, and you can’t accuse me of goalpost moving” and maybe “and you can kinda see that the goalposts are far away, but only by generalizing past the current difficulties”.
Sure, that’s one way to put this. But actually scalable oversight might give another way of framing this, making the intermediate parts of the problem even more important than the hardest parts (within the assumptions of the framing). The story there is that AIs immediately prior to either RSI or superintelligence are going to be the key tools in solving alignment for RSI or superintelligence, so it’s important for human developers to solve alignment up to that point, but it’s not important for human developers who don’t yet have access to such tools to solve the rest of the problem.
And when prosaic alignment methods look to be mostly solving the easy part of the problem, the standard modesty pronouncements about how these methods don’t directly apply to the hardest lethal part of the problem can be both earnestly acknowledged and fail to help at all, because it’s not seen as a relevant part of the problem to work on in the current regime. So if pretraining/RLVR/RLHF seems to be working, concluding that we are probably fine until RSI or superintelligence is tantamount to concluding that we are probably fine overall (within this framing around scalable oversight).
Thus noticing a major difficulty with the middle part of the problem becomes a crux for expectations about AI danger overall, even for some people who already acknowledge the difficulty of the hardest part of the problem.
Ok. If I’m following, I think I agree, except that I’d probably say “you mostly need to solve [what I’m calling the hard parts] in order to solve intermediate alignment well enough for pre-strong AIs to be the engines of alignment progress”. So either I’m wrong about what the hard parts are, or you actually need to solve the hard parts to get scalable oversight (and therefore it doesn’t really help much).