It would be interesting to know what the original reward models would say here—does the “screaming” score well according to the model of what humans would reward (or what human demonstrations would contain, depending on type of reward model)?
My suspicion is that the model has learned that apologizing, expressing distress etc after making a mistake is useful for getting reward. And also that you are doing some cherrypicking.
At the risk of making people do more morally grey things, have you considered doing a similar experiment with models finetuned on a math task (perhaps with a similar reward model setup for ease of comparison), which you then sabotage by trying to force them to give the wrong answer?
Very interesting!
It would be interesting to know what the original reward models would say here—does the “screaming” score well according to the model of what humans would reward (or what human demonstrations would contain, depending on type of reward model)?
My suspicion is that the model has learned that apologizing, expressing distress etc after making a mistake is useful for getting reward. And also that you are doing some cherrypicking.
At the risk of making people do more morally grey things, have you considered doing a similar experiment with models finetuned on a math task (perhaps with a similar reward model setup for ease of comparison), which you then sabotage by trying to force them to give the wrong answer?