Claim 4: GPT-N need not be “trying” to predict the next word. To elaborate: one model of GPT-N is that it is building a world model and making plans in the world model such that it predicts the next word as accurately as possible. This model is fine on-distribution but incorrect off-distribution. In particular, it predicts that GPT-N would e.g. deliberately convince humans to become more predictable so it can do better on future next-word predictions; this model prediction is probably wrong.
I got a bit confused by this section, I think because the word “model” is being used in two different ways, neither of which is in the sense of “machine learning model”.
Paraphrasing what I think is being said:
An observer (us) has a model_1 of what GPT-N is doing.
According to their model_1, GPT-N is building its own world model_2, that it uses to plan its actions.
The observer’s model_1 makes good predictions about GPT-N’s behavior when GPT-N (the machine learning model_3) is tested on data that comes from the training distribution, but bad predictions about what GPT-N will do when tested (or used) on data that does not come from the training distribution.
The way that the observer’s model_1 will be wrong is not that it will be fooled by GPT-N taking a treacherous turn, but rather the opposite—the observer’s model_1 will predict a treacherous turn, but instead GPT-N will go on filling in missing words, as in training (or something else?).
Huh, then it seems I misunderstood you then! The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn. But this seems unsupported by the argument you made, and in fact the opposite seems supported by the argument you made. The argument you made was:
There are several pretraining objectives that could have been used to train GPT-N other than next word prediction (e.g. masked language modeling). For each of these, there’s a corresponding model that the resulting GPT-N would “try” to <do the thing in the pretraining objective>. These models make different predictions about what GPT-N would do off distribution. However, by claim 3 it doesn’t matter much which pretraining objective you use, so most of these models would be wrong.
Seems to me the conclusion of this argument is that “In general it’s not true that the AI is trying to achieve its training objective.” The natural corollary is: We have no idea what the AI is trying to achieve, if it is trying to achieve anything at all. So instead of concluding “It’ll probably just keep filling in missing words as in training” we should conclude “we have no idea what it’ll do; treacherous turn is a real possibility because that’s what’ll happen for most goals it could have, and it may have a goal for all we know.”
The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn.
?? I said nothing about a treacherous turn? And where did I say it would go on filling in missing words?
EDIT: Ah, you mean the fourth bullet point in ESRogs response. I was thinking of that as one example of how such reasoning could go wrong, as opposed to the only case. So in that case the model_1 predicts a treacherous turn confidently, but this is the wrong epistemic state to be in because it is also plausible that it just “fills in words” instead.
Seems to me the conclusion of this argument is that “In general it’s not true that the AI is trying to achieve its training objective.”
Isn’t that effectively what I said? (I was trying to be more precise since “achieve its training objective” is ambiguous, but given what I understand you to mean by that phrase, I think it’s what I said?)
we have no idea what it’ll do; treacherous turn is a real possibility because that’s what’ll happen for most goals it could have, and it may have a goal for all we know.
This seems reasonable to me (and seems compatible with what I said)
I got a bit confused by this section, I think because the word “model” is being used in two different ways, neither of which is in the sense of “machine learning model”.
Paraphrasing what I think is being said:
An observer (us) has a model_1 of what GPT-N is doing.
According to their model_1, GPT-N is building its own world model_2, that it uses to plan its actions.
The observer’s model_1 makes good predictions about GPT-N’s behavior when GPT-N (the machine learning model_3) is tested on data that comes from the training distribution, but bad predictions about what GPT-N will do when tested (or used) on data that does not come from the training distribution.
The way that the observer’s model_1 will be wrong is not that it will be fooled by GPT-N taking a treacherous turn, but rather the opposite—the observer’s model_1 will predict a treacherous turn, but instead GPT-N will go on filling in missing words, as in training (or something else?).
Is that right?
Yes, that’s right, sorry about the confusion.
Huh, then it seems I misunderstood you then! The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn. But this seems unsupported by the argument you made, and in fact the opposite seems supported by the argument you made. The argument you made was:
Seems to me the conclusion of this argument is that “In general it’s not true that the AI is trying to achieve its training objective.” The natural corollary is: We have no idea what the AI is trying to achieve, if it is trying to achieve anything at all. So instead of concluding “It’ll probably just keep filling in missing words as in training” we should conclude “we have no idea what it’ll do; treacherous turn is a real possibility because that’s what’ll happen for most goals it could have, and it may have a goal for all we know.”
?? I said nothing about a treacherous turn? And where did I say it would go on filling in missing words?
EDIT: Ah, you mean the fourth bullet point in ESRogs response. I was thinking of that as one example of how such reasoning could go wrong, as opposed to the only case. So in that case the model_1 predicts a treacherous turn confidently, but this is the wrong epistemic state to be in because it is also plausible that it just “fills in words” instead.
Isn’t that effectively what I said? (I was trying to be more precise since “achieve its training objective” is ambiguous, but given what I understand you to mean by that phrase, I think it’s what I said?)
This seems reasonable to me (and seems compatible with what I said)
OK cool, sorry for the confusion. Yeah I think ESRogs interpretation of you was making a bit stronger claim than you actually were.