This comment is inspired by a conversation with Ajeya Cotra.
As a simple example of how the scaling hypothesis affects AI safety research, it suggests that the training objective (“predict the next word”) is relatively unimportant in determining properties of the trained agent; in contrast, the dataset is much more important. This suggests that analyses based on the “reward function used to train the agent” are probably not going to be very predictive of the systems we actually build.
To elaborate on this more:
Claim 1: Scaling hypothesis + abundance of data + competitiveness requirement implies that an alignment solution will need to involve pretraining.
Argument: The scaling hypothesis implies that you can get strong capabilities out of abundant effectively-free data. So, if you want your alignment proposal to be competitive, it must also get strong capabilities out of effectively-free data. So far, the only method we know of for this is pretraining.
Note that you could have schemes where you train an actor model using a reward model that is always aligned; in this case your actor model could avoid pretraining (since you can generate effectively-free data from the reward model) but your reward model will need to be pretrained. So the claim is that some part of your scheme involves pretraining; it doesn’t have to be the final agent that is deployed.
Claim 2: For a fixed ‘reasonable’ pretraining objective, there exists some (possibly crazy and bespoke but still reasonably-sized) dataset which would make the resulting model aligned without any finetuning.
(This claim is more of an intuition pump for Claim 3, rather than an interesting claim in its own right)
Argument 1: As long as your pretraining objective doesn’t do something unreasonable like say “ignore the data, always say ‘hello world’”, given the fixed pretraining objective each data point acts as a “constraint” on the parameters of the model. If you have D data points and N model parameters with D > N, then you should expect these constraints to approximately determine the model parameters (in the same way that N linearly independent equations on N variables uniquely determine those variables). So with the appropriate choice of the D data points, you should be able to get any model parameters you want, including the parameters of the aligned model.
Argument 2: There are ~tens of bits going into the choice of pretraining objective, and ~millions of bits going into the dataset, so in some sense nearly all of the action is in the dataset.
Argument 3: For the specific case of next-word prediction, you could take an aligned model, generate a dataset by running that model, and then train a new model with next-word prediction on that dataset. I believe this is equivalent to model distillation, which has been found to be really unreasonably effective, including for generalization (see e.g. here), so I’d expect the resulting model would be aligned too.
Claim 3: If you don’t control the dataset, it mostly doesn’t matter what pretraining objective you use (assuming you use a simple one rather than e.g. a reward function that encodes all of human values); the properties of the model are going to be roughly similar regardless.
Argument: Basically the same as for Claim 2: by far most of the influence on which model you get out is coming from the dataset.
(This is probably the weakest argument in the chain; just because most of the influence comes from the dataset doesn’t mean that the pretraining objective can’t have influence as well. I still think the claim is true though, and I still feel pretty confident about the final conclusion in the next claim.)
Claim 4: GPT-N need not be “trying” to predict the next word. To elaborate: one model of GPT-N is that it is building a world model and making plans in the world model such that it predicts the next word as accurately as possible. This model is fine on-distribution but incorrect off-distribution. In particular, it predicts that GPT-N would e.g. deliberately convince humans to become more predictable so it can do better on future next-word predictions; this model prediction is probably wrong.
Argument: There are several pretraining objectives that could have been used to train GPT-N other than next word prediction (e.g. masked language modeling). For each of these, there’s a corresponding model that the resulting GPT-N would “try” to <do the thing in the pretraining objective>. These models make different predictions about what GPT-N would do off distribution. However, by claim 3 it doesn’t matter much which pretraining objective you use, so most of these models would be wrong.
Claim 4: GPT-N need not be “trying” to predict the next word. To elaborate: one model of GPT-N is that it is building a world model and making plans in the world model such that it predicts the next word as accurately as possible. This model is fine on-distribution but incorrect off-distribution. In particular, it predicts that GPT-N would e.g. deliberately convince humans to become more predictable so it can do better on future next-word predictions; this model prediction is probably wrong.
I got a bit confused by this section, I think because the word “model” is being used in two different ways, neither of which is in the sense of “machine learning model”.
Paraphrasing what I think is being said:
An observer (us) has a model_1 of what GPT-N is doing.
According to their model_1, GPT-N is building its own world model_2, that it uses to plan its actions.
The observer’s model_1 makes good predictions about GPT-N’s behavior when GPT-N (the machine learning model_3) is tested on data that comes from the training distribution, but bad predictions about what GPT-N will do when tested (or used) on data that does not come from the training distribution.
The way that the observer’s model_1 will be wrong is not that it will be fooled by GPT-N taking a treacherous turn, but rather the opposite—the observer’s model_1 will predict a treacherous turn, but instead GPT-N will go on filling in missing words, as in training (or something else?).
Huh, then it seems I misunderstood you then! The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn. But this seems unsupported by the argument you made, and in fact the opposite seems supported by the argument you made. The argument you made was:
There are several pretraining objectives that could have been used to train GPT-N other than next word prediction (e.g. masked language modeling). For each of these, there’s a corresponding model that the resulting GPT-N would “try” to <do the thing in the pretraining objective>. These models make different predictions about what GPT-N would do off distribution. However, by claim 3 it doesn’t matter much which pretraining objective you use, so most of these models would be wrong.
Seems to me the conclusion of this argument is that “In general it’s not true that the AI is trying to achieve its training objective.” The natural corollary is: We have no idea what the AI is trying to achieve, if it is trying to achieve anything at all. So instead of concluding “It’ll probably just keep filling in missing words as in training” we should conclude “we have no idea what it’ll do; treacherous turn is a real possibility because that’s what’ll happen for most goals it could have, and it may have a goal for all we know.”
The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn.
?? I said nothing about a treacherous turn? And where did I say it would go on filling in missing words?
EDIT: Ah, you mean the fourth bullet point in ESRogs response. I was thinking of that as one example of how such reasoning could go wrong, as opposed to the only case. So in that case the model_1 predicts a treacherous turn confidently, but this is the wrong epistemic state to be in because it is also plausible that it just “fills in words” instead.
Seems to me the conclusion of this argument is that “In general it’s not true that the AI is trying to achieve its training objective.”
Isn’t that effectively what I said? (I was trying to be more precise since “achieve its training objective” is ambiguous, but given what I understand you to mean by that phrase, I think it’s what I said?)
we have no idea what it’ll do; treacherous turn is a real possibility because that’s what’ll happen for most goals it could have, and it may have a goal for all we know.
This seems reasonable to me (and seems compatible with what I said)
Claim 3: If you don’t control the dataset, it mostly doesn’t matter what pretraining objective you use (assuming you use a simple one rather than e.g. a reward function that encodes all of human values); the properties of the model are going to be roughly similar regardless.
Analogous claim: since any program specifiable under UTM U1 is also expressible under UTM U2, choice of UTM doesn’t matter.
And this is true up to a point: up to constant factors, it doesn’t matter. But U1 can make it easier (simplier, faster, etc) to specify a set of programs than does U2. And so “there exists a program in U2-encoding which implements P in U1-encoding” doesn’t get everything I want: I want to reason about the distribution of programs, about how hard it tends to be to get programs with desirable properties.
Stepping out of the analogy, even though I agree that “reasonable” pretraining objectives are all compatible with aligned / unaligned /arbitrarily behaved models, this argument seems to leave room that some objectives make alignment far more likely, a priori. And you may be noting as much:
(This is probably the weakest argument in the chain; just because most of the influence comes from the dataset doesn’t mean that the pretraining objective can’t have influence as well. I still think the claim is true though, and I still feel pretty confident about the final conclusion in the next claim.)
Yeah, I agree with all this. I still think the pretraining objective basically doesn’t matter for alignment (beyond being “reasonable”) but I don’t think the argument I’ve given establishes that.
I do think the arguments in support of Claim 2 are sufficient to at least raise Claim 3 to attention (and thus Claim 4 as well).
Additional note for posterity: when I talked about “some objectives [may] make alignment far more likely”, I was considering something like “given this pretraining objective and an otherwise fixed training process, what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?”, perhaps also weighting by ease of specification in some sense.
what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?”, perhaps also weighting by ease of specification in some sense.
You’re going to need the ease of specification condition, or something similar; else you’ll probably run into no-free-lunch considerations (at which point I think you’ve stopped talking about anything useful).
This comment is inspired by a conversation with Ajeya Cotra.
To elaborate on this more:
Claim 1: Scaling hypothesis + abundance of data + competitiveness requirement implies that an alignment solution will need to involve pretraining.
Argument: The scaling hypothesis implies that you can get strong capabilities out of abundant effectively-free data. So, if you want your alignment proposal to be competitive, it must also get strong capabilities out of effectively-free data. So far, the only method we know of for this is pretraining.
Note that you could have schemes where you train an actor model using a reward model that is always aligned; in this case your actor model could avoid pretraining (since you can generate effectively-free data from the reward model) but your reward model will need to be pretrained. So the claim is that some part of your scheme involves pretraining; it doesn’t have to be the final agent that is deployed.
Claim 2: For a fixed ‘reasonable’ pretraining objective, there exists some (possibly crazy and bespoke but still reasonably-sized) dataset which would make the resulting model aligned without any finetuning.
(This claim is more of an intuition pump for Claim 3, rather than an interesting claim in its own right)
Argument 1: As long as your pretraining objective doesn’t do something unreasonable like say “ignore the data, always say ‘hello world’”, given the fixed pretraining objective each data point acts as a “constraint” on the parameters of the model. If you have D data points and N model parameters with D > N, then you should expect these constraints to approximately determine the model parameters (in the same way that N linearly independent equations on N variables uniquely determine those variables). So with the appropriate choice of the D data points, you should be able to get any model parameters you want, including the parameters of the aligned model.
Argument 2: There are ~tens of bits going into the choice of pretraining objective, and ~millions of bits going into the dataset, so in some sense nearly all of the action is in the dataset.
Argument 3: For the specific case of next-word prediction, you could take an aligned model, generate a dataset by running that model, and then train a new model with next-word prediction on that dataset.
I believe this is equivalent to model distillation, which has been found to be really unreasonably effective, including for generalization (see e.g. here), so I’d expect the resulting model would be aligned too.
Claim 3: If you don’t control the dataset, it mostly doesn’t matter what pretraining objective you use (assuming you use a simple one rather than e.g. a reward function that encodes all of human values); the properties of the model are going to be roughly similar regardless.
Argument: Basically the same as for Claim 2: by far most of the influence on which model you get out is coming from the dataset.
(This is probably the weakest argument in the chain; just because most of the influence comes from the dataset doesn’t mean that the pretraining objective can’t have influence as well. I still think the claim is true though, and I still feel pretty confident about the final conclusion in the next claim.)
Claim 4: GPT-N need not be “trying” to predict the next word. To elaborate: one model of GPT-N is that it is building a world model and making plans in the world model such that it predicts the next word as accurately as possible. This model is fine on-distribution but incorrect off-distribution. In particular, it predicts that GPT-N would e.g. deliberately convince humans to become more predictable so it can do better on future next-word predictions; this model prediction is probably wrong.
Argument: There are several pretraining objectives that could have been used to train GPT-N other than next word prediction (e.g. masked language modeling). For each of these, there’s a corresponding model that the resulting GPT-N would “try” to <do the thing in the pretraining objective>. These models make different predictions about what GPT-N would do off distribution. However, by claim 3 it doesn’t matter much which pretraining objective you use, so most of these models would be wrong.
I got a bit confused by this section, I think because the word “model” is being used in two different ways, neither of which is in the sense of “machine learning model”.
Paraphrasing what I think is being said:
An observer (us) has a model_1 of what GPT-N is doing.
According to their model_1, GPT-N is building its own world model_2, that it uses to plan its actions.
The observer’s model_1 makes good predictions about GPT-N’s behavior when GPT-N (the machine learning model_3) is tested on data that comes from the training distribution, but bad predictions about what GPT-N will do when tested (or used) on data that does not come from the training distribution.
The way that the observer’s model_1 will be wrong is not that it will be fooled by GPT-N taking a treacherous turn, but rather the opposite—the observer’s model_1 will predict a treacherous turn, but instead GPT-N will go on filling in missing words, as in training (or something else?).
Is that right?
Yes, that’s right, sorry about the confusion.
Huh, then it seems I misunderstood you then! The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn. But this seems unsupported by the argument you made, and in fact the opposite seems supported by the argument you made. The argument you made was:
Seems to me the conclusion of this argument is that “In general it’s not true that the AI is trying to achieve its training objective.” The natural corollary is: We have no idea what the AI is trying to achieve, if it is trying to achieve anything at all. So instead of concluding “It’ll probably just keep filling in missing words as in training” we should conclude “we have no idea what it’ll do; treacherous turn is a real possibility because that’s what’ll happen for most goals it could have, and it may have a goal for all we know.”
?? I said nothing about a treacherous turn? And where did I say it would go on filling in missing words?
EDIT: Ah, you mean the fourth bullet point in ESRogs response. I was thinking of that as one example of how such reasoning could go wrong, as opposed to the only case. So in that case the model_1 predicts a treacherous turn confidently, but this is the wrong epistemic state to be in because it is also plausible that it just “fills in words” instead.
Isn’t that effectively what I said? (I was trying to be more precise since “achieve its training objective” is ambiguous, but given what I understand you to mean by that phrase, I think it’s what I said?)
This seems reasonable to me (and seems compatible with what I said)
OK cool, sorry for the confusion. Yeah I think ESRogs interpretation of you was making a bit stronger claim than you actually were.
Analogous claim: since any program specifiable under UTM U1 is also expressible under UTM U2, choice of UTM doesn’t matter.
And this is true up to a point: up to constant factors, it doesn’t matter. But U1 can make it easier (simplier, faster, etc) to specify a set of programs than does U2. And so “there exists a program in U2-encoding which implements P in U1-encoding” doesn’t get everything I want: I want to reason about the distribution of programs, about how hard it tends to be to get programs with desirable properties.
Stepping out of the analogy, even though I agree that “reasonable” pretraining objectives are all compatible with aligned / unaligned /arbitrarily behaved models, this argument seems to leave room that some objectives make alignment far more likely, a priori. And you may be noting as much:
Yeah, I agree with all this. I still think the pretraining objective basically doesn’t matter for alignment (beyond being “reasonable”) but I don’t think the argument I’ve given establishes that.
I do think the arguments in support of Claim 2 are sufficient to at least raise Claim 3 to attention (and thus Claim 4 as well).
Sure.
Additional note for posterity: when I talked about “some objectives [may] make alignment far more likely”, I was considering something like “given this pretraining objective and an otherwise fixed training process, what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?”, perhaps also weighting by ease of specification in some sense.
You’re going to need the ease of specification condition, or something similar; else you’ll probably run into no-free-lunch considerations (at which point I think you’ve stopped talking about anything useful).