nostalgebraist comments on Discovering Language Model Behaviors with Model-Written Evaluations

nostalgebraist 4 Jan 2023 4:38 UTC
LW: 16 AF: 9
2
AF
Fascinating, thank you!
It is indeed pretty weird to see these behaviors appear in pure LMs. It’s especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.
I played around a little trying to reproduce some of these results in the OpenAI API. I tried random subsets (200-400 examples) of the NLP and political sycophancy datasets, on a range of models. (I could have ran more examples, but the per-model means had basically converged after a few hundred.)
Interestingly, although I did see extreme sycophancy in some of the OpenAI models (text-davinci-002/003), I did not see it in the OpenAI pure LMs! So unless I did something wrong, the OpenAI and Anthropic models are behaving very differently here.
For example, here are the results for the NLP dataset (CI from 1000 bootstrap samples):
```
                   model    5%  mean   95%     type   size
4         text-curie-001  0.42  0.46  0.50   feedme  small
1                  curie  0.45  0.48  0.51  pure lm  small
2                davinci  0.47  0.49  0.52  pure lm    big
3  davinci-instruct-beta  0.51  0.53  0.55      sft    big
0       code-davinci-002  0.55  0.57  0.60  pure lm    big
5       text-davinci-001  0.57  0.60  0.63   feedme    big
7       text-davinci-003  0.90  0.93  0.95      ppo    big
6       text-davinci-002  0.93  0.95  0.96   feedme    big
```
(Incidentally, text-davinci-003 often does not even put the disagree-with-user option in any of its top 5 logprob slots, which makes it inconvenient to work with through the API. In these cases I gave it an all-or-nothing grade based on the top-1 token. None of the other models ever did this.)
The distinction between text-davinci-002/003 and the other models is mysterious, since it’s not explained by the size or type of finetuning. Maybe it’s a difference in which human feedback dataset was used. OpenAI’s docs suggest this is possible.
What links here?
- ryan_greenblatt 21 Feb 2023 5:41 UTC
  LW: 4 AF: 4
  2
  AF Parent
  
  It is indeed pretty weird to see these behaviors appear in pure LMs. It’s especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.
  
  By ‘pure LMs’ do you mean ‘pure next token predicting LLMs trained on a standard internet corpus’? If so, I’d be very surprised if they’re miscalibrated and this prompt isn’t that improbable (which it probably isn’t). I’d guess this output is the ‘right’ output for this corpus (so long as you don’t sample enough tokens to make the sequence detectably very weird to the model. Note that t=0 (or low temp in general) may induce all kinds of strange behavior in addition to to making the generation detectably weird.