Just a note here: I would not interpret fine-tuned GPTs as still “predicting” tokens. Base models predict tokens by computing a probability distribution conditional on the prompt, but for fine-tuned models this distribution no longer represents probabilities, but some “goodness” relative to the fine-tuning, how good the continuation is. Tokens with higher numbers are then not necessarily more probable continuations of the prompt (though next token probability may also play a role) but overall “better” in some opaque way. We hope that what the model thinks is a better token for the continuation of the prompt corresponds to the goals of being helpful, harmless and honest (to use the Anthropic terminology), but whether the model has really learned those goals, or merely something which looks similar, is ultimately unknown.
So RLHF (and equally supervised fine-tuning) also leads to a lack of interpretability. It is unknown what exactly an instruction model like ChatGPT or text-davinci-003 optimizes for. In contrast to this, we know pretty exactly what a base model optimized for: Next token prediction.
You know exactly what both models are optimized for: log loss on the one hand, an unbiased estimator of reward on the other.
You don’t know what either model is optimizing: how would you? In both cases you could guess that they may be optimizing something similar to what they are optimized for.
This relates to what you wrote in the other thread:
I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.
It think the difference is that a base language model is trained on vast amounts of text, so it seems reasonable that it is actually quite good at next token prediction, while the fine-tuning is apparently done with comparatively tiny amounts of preference data. So misalignment seems much more likely in the latter case.
Moreover, human RLHF raters are probably biased in various ways, which encourages the model to reproduce those biases, even if the model doesn’t “believe them” in some sense. For example, some scientists have pointed out that ChatGPT gives politically correct but wrong answers to certain politically taboo but factual questions. (I can go into more detail if required.) Whether the model is honest here and in fact “believes” those things, or whether it is deceptive and just reproduces rater bias rather than being honest, is unknown.
So learning to predict webtext from large amounts of training data, and learning some kind of well-aligned utility function from a small number of (biased) human raters seem problems of highly uneven difficulty and probability of misalignment.
Agreed, though I do find framing them as a warped predictor helpful in some cases. In principle, the deviation from the original unbiased prediction over all inputs should include within it all agentic behaviors, and there might exist some way that you could extract goals from that bias vector. (I don’t have anything super concrete here and I’m not super optimistic that this framing gives you anything extra compared to other interpretability mechanisms, but it’s something I’ve thought about poking.)
Just a note here: I would not interpret fine-tuned GPTs as still “predicting” tokens. Base models predict tokens by computing a probability distribution conditional on the prompt, but for fine-tuned models this distribution no longer represents probabilities, but some “goodness” relative to the fine-tuning, how good the continuation is. Tokens with higher numbers are then not necessarily more probable continuations of the prompt (though next token probability may also play a role) but overall “better” in some opaque way. We hope that what the model thinks is a better token for the continuation of the prompt corresponds to the goals of being helpful, harmless and honest (to use the Anthropic terminology), but whether the model has really learned those goals, or merely something which looks similar, is ultimately unknown.
So RLHF (and equally supervised fine-tuning) also leads to a lack of interpretability. It is unknown what exactly an instruction model like ChatGPT or text-davinci-003 optimizes for. In contrast to this, we know pretty exactly what a base model optimized for: Next token prediction.
You know exactly what both models are optimized for: log loss on the one hand, an unbiased estimator of reward on the other.
You don’t know what either model is optimizing: how would you? In both cases you could guess that they may be optimizing something similar to what they are optimized for.
This relates to what you wrote in the other thread:
It think the difference is that a base language model is trained on vast amounts of text, so it seems reasonable that it is actually quite good at next token prediction, while the fine-tuning is apparently done with comparatively tiny amounts of preference data. So misalignment seems much more likely in the latter case.
Moreover, human RLHF raters are probably biased in various ways, which encourages the model to reproduce those biases, even if the model doesn’t “believe them” in some sense. For example, some scientists have pointed out that ChatGPT gives politically correct but wrong answers to certain politically taboo but factual questions. (I can go into more detail if required.) Whether the model is honest here and in fact “believes” those things, or whether it is deceptive and just reproduces rater bias rather than being honest, is unknown.
So learning to predict webtext from large amounts of training data, and learning some kind of well-aligned utility function from a small number of (biased) human raters seem problems of highly uneven difficulty and probability of misalignment.
Agreed, though I do find framing them as a warped predictor helpful in some cases. In principle, the deviation from the original unbiased prediction over all inputs should include within it all agentic behaviors, and there might exist some way that you could extract goals from that bias vector. (I don’t have anything super concrete here and I’m not super optimistic that this framing gives you anything extra compared to other interpretability mechanisms, but it’s something I’ve thought about poking.)