ryan_greenblatt comments on Discovering Language Model Behaviors with Model-Written Evaluations

ryan_greenblatt 21 Feb 2023 5:41 UTC
LW: 4 AF: 4
2
AF

It is indeed pretty weird to see these behaviors appear in pure LMs. It’s especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.

By ‘pure LMs’ do you mean ‘pure next token predicting LLMs trained on a standard internet corpus’? If so, I’d be very surprised if they’re miscalibrated and this prompt isn’t that improbable (which it probably isn’t). I’d guess this output is the ‘right’ output for this corpus (so long as you don’t sample enough tokens to make the sequence detectably very weird to the model. Note that t=0 (or low temp in general) may induce all kinds of strange behavior in addition to to making the generation detectably weird.