If prosaic alignment looks like it might be succeeding in having capable AIs additionally behave in manageable ways, continual learning is a simple plausible next change that might break assurances suggested by that impression. The usual go-to for such concerns is RSI, possibly up to superintelligence. But prosaic propensity training is about just the current behavior, so it doesn’t need to take a significant change to exit a manageable regime where the developers have a sense of how the model is behaving, the sense they used in feedback loops ensuring the model behaves manageably.
I’m reading this as pointing to lack of oversight being coordinated with emergence of strong capabilities, either being necessary for it in practice, or by-default following from an overhang created by it. But the current situation is that RL creates strong capabilities (at least in niche settings of RL environments), which prosaic alignment might then be able to elicit through an LLM persona, so that the strategicness of the persona (and its propensities) isn’t necessarily particularly connected to the process that created the topic-specific capabilities. A student learns from a textbook without necessarily being (remaining) the same kind of thing as the textbook’s author, even if the same model passes through these stages.
The overhang/wildfire concern is still present, and the current form the models take (with frozen weights) may create a sense of security that’s both real if the models retain this form, but also misleading in that the pressure to abandon it to enable continual learning is going to be strong. There is some talk about neuralese vs. text-only CoT, but continual learning creates the same issue of lack of oversight (through inscrutability) of the thing that ends up in control of AI’s behavior (in the case of neuralese, it’s the inscrutable latent CoT; in the case of continual learning, it’s the updated model that learned from personal experience).
the strategicness of the persona (and its propensities) isn’t necessarily particularly connected to the process that created the topic-specific capabilities.
misleading in that the pressure to abandon it to enable continual learning is going to be strong.
Right—in general, the ongoing source of capabilities is what’s at question, and all “alignment” methods that ignore this are useless.
The current pretraining/RLVR/RLHF process illustrates how practical alignment methods plausibly get to defeat a source of capabilities that are quite strong by many metrics. These alignment methods are specific to their niche, they are probably going to be useless outside of it, to deal with new sources of different kinds of capabilities, but they are quite useful where they apply.
So I’m not making a general point about alignment methods and sources of capabilities. I’m merely gesturing at a plausible next source of capabilities that the current methods might very soon become helpless to manage. Modesty memes make RSI or superintelligence too salient as something the current methods of alignment won’t yet be able to defeat, it might be useful to look at the weakest next thing instead.
I think the issue with looking at the weakest next thing is that you don’t generalize. The lack of generalization might be ok, but sometimes it’s not ok. As an example, suppose you have a method which can interpret an AI that’s “fixed” (whatever that might mean). Suppose further that each time there’s an “AI advance”, this method requires some additional work to make it able to interpret a new “fixed” AI. You might say that interpretability is mostly solved. But you might be missing that actually what’s happening is that your method is good at interpreting mental content that’s already been made explicit (https://tsvibt.blogspot.com/2023/03/explicitness.html), and is bad at interpreting mental content that is inexplicit. It’s not that you’ve mostly solved the whole problem, it’s that you’ve 100% solved (the easy) half of the problem, and 0% solved (the hard) half of the problem.
It’s not that you’ve mostly solved the whole problem, it’s that you’ve 100% solved (the easy) half of the problem, and 0% solved (the hard) half of the problem.
Let’s say there are three parts of the problem, the easy solved part, the next thing middle part, and the hard lethal part. The claim that we 0% solved the hard part of the problem is quite popular, that current prosaic alignment methods don’t apply to RSI or superintelligence. There might be other things suggested as helping with RSI or superintelligence, such as scalable oversight, but that’s not what I’m talking about.
This thread is about the middle part of the problem, that having mostly solved the easy part of the problem shouldn’t be construed as significant evidence that we are OK until RSI or superintelligence. The failure to generalize can start applying earlier than the hardest part of the problem, which is an immediate concern of its own. Only focusing on this immediate concern might distract from the further generalization to the hardest part of the problem, but that’s not a reason to endorse failure to notice the immediate concern, these issues shouldn’t compete with each other.
I think I don’t disagree. If there’s something I’m trying to “defend” here, it would be like “later, when some of the medium-hard parts are kinda solved, I’m not going to update very much about the hard parts, and you can’t accuse me of goalpost moving” and maybe “and you can kinda see that the goalposts are far away, but only by generalizing past the current difficulties”.
Sure, that’s one way to put this. But actually scalable oversight might give another way of framing this, making the intermediate parts of the problem even more important than the hardest parts (within the assumptions of the framing). The story there is that AIs immediately prior to either RSI or superintelligence are going to be the key tools in solving alignment for RSI or superintelligence, so it’s important for human developers to solve alignment up to that point, but it’s not important for human developers who don’t yet have access to such tools to solve the rest of the problem.
And when prosaic alignment methods look to be mostly solving the easy part of the problem, the standard modesty pronouncements about how these methods don’t directly apply to the hardest lethal part of the problem can be both earnestly acknowledged and fail to help at all, because it’s not seen as a relevant part of the problem to work on in the current regime. So if pretraining/RLVR/RLHF seems to be working, concluding that we are probably fine until RSI or superintelligence is tantamount to concluding that we are probably fine overall (within this framing around scalable oversight).
Thus noticing a major difficulty with the middle part of the problem becomes a crux for expectations about AI danger overall, even for some people who already acknowledge the difficulty of the hardest part of the problem.
Ok. If I’m following, I think I agree, except that I’d probably say “you mostly need to solve [what I’m calling the hard parts] in order to solve intermediate alignment well enough for pre-strong AIs to be the engines of alignment progress”. So either I’m wrong about what the hard parts are, or you actually need to solve the hard parts to get scalable oversight (and therefore it doesn’t really help much).
Cf. https://tsvibt.blogspot.com/2023/01/wildfire-of-strategicness.html and https://www.lesswrong.com/posts/Ht4JZtxngKwuQ7cDC/tsvibt-s-shortform?commentId=koeti9ygXB9wPLnnF
If prosaic alignment looks like it might be succeeding in having capable AIs additionally behave in manageable ways, continual learning is a simple plausible next change that might break assurances suggested by that impression. The usual go-to for such concerns is RSI, possibly up to superintelligence. But prosaic propensity training is about just the current behavior, so it doesn’t need to take a significant change to exit a manageable regime where the developers have a sense of how the model is behaving, the sense they used in feedback loops ensuring the model behaves manageably.
I’m reading this as pointing to lack of oversight being coordinated with emergence of strong capabilities, either being necessary for it in practice, or by-default following from an overhang created by it. But the current situation is that RL creates strong capabilities (at least in niche settings of RL environments), which prosaic alignment might then be able to elicit through an LLM persona, so that the strategicness of the persona (and its propensities) isn’t necessarily particularly connected to the process that created the topic-specific capabilities. A student learns from a textbook without necessarily being (remaining) the same kind of thing as the textbook’s author, even if the same model passes through these stages.
The overhang/wildfire concern is still present, and the current form the models take (with frozen weights) may create a sense of security that’s both real if the models retain this form, but also misleading in that the pressure to abandon it to enable continual learning is going to be strong. There is some talk about neuralese vs. text-only CoT, but continual learning creates the same issue of lack of oversight (through inscrutability) of the thing that ends up in control of AI’s behavior (in the case of neuralese, it’s the inscrutable latent CoT; in the case of continual learning, it’s the updated model that learned from personal experience).
Right—in general, the ongoing source of capabilities is what’s at question, and all “alignment” methods that ignore this are useless.
The current pretraining/RLVR/RLHF process illustrates how practical alignment methods plausibly get to defeat a source of capabilities that are quite strong by many metrics. These alignment methods are specific to their niche, they are probably going to be useless outside of it, to deal with new sources of different kinds of capabilities, but they are quite useful where they apply.
So I’m not making a general point about alignment methods and sources of capabilities. I’m merely gesturing at a plausible next source of capabilities that the current methods might very soon become helpless to manage. Modesty memes make RSI or superintelligence too salient as something the current methods of alignment won’t yet be able to defeat, it might be useful to look at the weakest next thing instead.
I think the issue with looking at the weakest next thing is that you don’t generalize. The lack of generalization might be ok, but sometimes it’s not ok. As an example, suppose you have a method which can interpret an AI that’s “fixed” (whatever that might mean). Suppose further that each time there’s an “AI advance”, this method requires some additional work to make it able to interpret a new “fixed” AI. You might say that interpretability is mostly solved. But you might be missing that actually what’s happening is that your method is good at interpreting mental content that’s already been made explicit (https://tsvibt.blogspot.com/2023/03/explicitness.html), and is bad at interpreting mental content that is inexplicit. It’s not that you’ve mostly solved the whole problem, it’s that you’ve 100% solved (the easy) half of the problem, and 0% solved (the hard) half of the problem.
Let’s say there are three parts of the problem, the easy solved part, the next thing middle part, and the hard lethal part. The claim that we 0% solved the hard part of the problem is quite popular, that current prosaic alignment methods don’t apply to RSI or superintelligence. There might be other things suggested as helping with RSI or superintelligence, such as scalable oversight, but that’s not what I’m talking about.
This thread is about the middle part of the problem, that having mostly solved the easy part of the problem shouldn’t be construed as significant evidence that we are OK until RSI or superintelligence. The failure to generalize can start applying earlier than the hardest part of the problem, which is an immediate concern of its own. Only focusing on this immediate concern might distract from the further generalization to the hardest part of the problem, but that’s not a reason to endorse failure to notice the immediate concern, these issues shouldn’t compete with each other.
I think I don’t disagree. If there’s something I’m trying to “defend” here, it would be like “later, when some of the medium-hard parts are kinda solved, I’m not going to update very much about the hard parts, and you can’t accuse me of goalpost moving” and maybe “and you can kinda see that the goalposts are far away, but only by generalizing past the current difficulties”.
Sure, that’s one way to put this. But actually scalable oversight might give another way of framing this, making the intermediate parts of the problem even more important than the hardest parts (within the assumptions of the framing). The story there is that AIs immediately prior to either RSI or superintelligence are going to be the key tools in solving alignment for RSI or superintelligence, so it’s important for human developers to solve alignment up to that point, but it’s not important for human developers who don’t yet have access to such tools to solve the rest of the problem.
And when prosaic alignment methods look to be mostly solving the easy part of the problem, the standard modesty pronouncements about how these methods don’t directly apply to the hardest lethal part of the problem can be both earnestly acknowledged and fail to help at all, because it’s not seen as a relevant part of the problem to work on in the current regime. So if pretraining/RLVR/RLHF seems to be working, concluding that we are probably fine until RSI or superintelligence is tantamount to concluding that we are probably fine overall (within this framing around scalable oversight).
Thus noticing a major difficulty with the middle part of the problem becomes a crux for expectations about AI danger overall, even for some people who already acknowledge the difficulty of the hardest part of the problem.
Ok. If I’m following, I think I agree, except that I’d probably say “you mostly need to solve [what I’m calling the hard parts] in order to solve intermediate alignment well enough for pre-strong AIs to be the engines of alignment progress”. So either I’m wrong about what the hard parts are, or you actually need to solve the hard parts to get scalable oversight (and therefore it doesn’t really help much).