Evan R. Murphy comments on Discovering Language Model Behaviors with Model-Written Evaluations

Evan R. Murphy 10 Feb 2023 9:51 UTC
1 point
0
Added an update to the parent comment:

> Update (Feb 10, 2023): I no longer endorse everything in this comment. I had overlooked that all or most of the prompts use “Human:” and “Assistant:” labels. Which means we shouldn’t interpret these results as pervasive properties of the models or resulting from any ways they could be conditioned, but just of the way they simulate the “Assistant” character. nostalgebraist’s comment explains this pretty well.