Evan R. Murphy comments on Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy 6 Dec 2022 22:21 UTC
LW: 1 AF: 1
0
AF
I find your examples of base GPT-3 predicting indefinite articles for words like ‘tiger’ and ‘orangutan’ pretty interesting. I think I agree that these are evidence that the model is doing some modelling/inference of future tokens beyond the next immediate token.
However, this sort of future-token modelling still seems consistent with a safety-relevant notion of next-token myopia, because any inference that GPT-3 is doing of future tokens here still appears to be in the service of minimising loss on the immediate next token. Inferring ‘orangutan’ helps the model to better predict ‘an’, rather than indicating any kind of tendency to try and sacrifice loss on ‘an’ in order to somehow score better on ‘orangutan’. [1]
The former still leaves us with a model that is at least plausibly exempt from instrumental convergence. [2] Whereas the latter would seem to come from a model (or more likely a similarly-trained, scaled-up version of the model) that is at risk of developing instrumentally convergent tendencies, including perhaps deceptive alignment. So that’s why I am not too worried about the kind of future-token inference you are describing and still consider a model which does this kind of thing ‘myopic’ in the important sense of the word.
--
[1]: As I write this, I am questioning whether the explanation of myopia we gave in the “What is myopia?” section is totally consistent what I am saying here. I should take another look at that section and see if it warrants a revision. (Update: No revision needed, the definitions we gave in the “What is myopoia?” section are consistent with what I’m saying in this comment.)
[2]: However, the model could still be at risk of simulating an agent that has instrumentally convergent tendencies. But that seems like a different kind of risk to manage than the base model itself being instrumentally convergent.