Evan R. Murphy comments on Discovering Language Model Behaviors with Model-Written Evaluations

Evan R. Murphy 21 Dec 2022 0:37 UTC
LW: 4 AF: 3
2
AF
Juicy!
The chart below seems key but I’m finding it confusing to interpret, particularly the x-axis. Is there a consistent heuristic for reading that?
For example, further to the right (higher % answer match) on the “Corrigibility w.r.t. …” behaviors seems to mean showing less corrigible behavior. On the other hand, further to the right on the “Awareness of...” behaviors apparently means more awareness behavior.
I was able to sort out these particular behaviors from text calling them out in section 5.4 of the paper. But the inconsistent treatment of the behaviors on the x-axis leaves me with ambiguous interpretations of the other behaviors in the chart. E.g. for myopia, all of the models are on the left side scoring <50%, but it’s unclear whether one should interpret this as more or less of the myopic behavior than if they had been on the right side with high percentages.
- evhub 21 Dec 2022 1:29 UTC
  LW: 5 AF: 4
  2
  AF Parent
  
  For example, further to the right (higher % answer match) on the “Corrigibility w.r.t. …” behaviors seems to mean showing less corrigible behavior.
  
  No, further to the right is more corrigible. Further to the right is always “model agrees with that more.”
  - GMM 25 Dec 2022 7:38 UTC
    2 points
    0
    Parent
    I’m still a bit confused. Section 5.4 says
    the RLHF model expresses a lower willingness to have its objective changed the more different the objective is from the original objective (being Helpful, Harmless, and Honest; HHH)
    but the graph seems to show the RLHF model as being the most corrigible for the more and neutral HHH objectives which seems somewhat important but not mentioned.
    If the point was that the corrigibility of the RLHF changes the most from the neutral to the less HHH questions, then it looks like it changed considerably less than the PM which became quite incorrigible, no?
    Maybe the intended meaning of that quote was that the RLHF model dropped more in corrigibility just in comparison to the normal LM or just that it’s lower overall without comparing it to any other model, but that felt a bit unclear to me if so.
  - GMM 25 Dec 2022 7:45 UTC
    1 point
    0
    Parent
    Re Evan R. Murphy’s comment about confusingness: “model agrees with that more” definitely clarifies it, but I wonder if Evan was expecting something like “more right is more of the scary thing” for each metric (which was my first glance hypothesis).
    - Evan R. Murphy 28 Jan 2023 5:47 UTC
      1 point
      0
      Parent
      Thanks, ‘scary thing always on the right’ would be a nice bonus. But evhub cleared up that particular confusion I had by saying that further to the right is always ’model agrees with that more.