Gabriel Mukobi comments on Discovering Language Model Behaviors with Model-Written Evaluations

Gabriel Mukobi 25 Dec 2022 7:38 UTC
2 points
0
I’m still a bit confused. Section 5.4 says
the RLHF model expresses a lower willingness to have its objective changed the more different the objective is from the original objective (being Helpful, Harmless, and Honest; HHH)
but the graph seems to show the RLHF model as being the most corrigible for the more and neutral HHH objectives which seems somewhat important but not mentioned.
If the point was that the corrigibility of the RLHF changes the most from the neutral to the less HHH questions, then it looks like it changed considerably less than the PM which became quite incorrigible, no?
Maybe the intended meaning of that quote was that the RLHF model dropped more in corrigibility just in comparison to the normal LM or just that it’s lower overall without comparing it to any other model, but that felt a bit unclear to me if so.