Charlie Steiner comments on Why White-Box Redteaming Makes Me Feel Weird

Charlie Steiner 17 Mar 2025 5:16 UTC
20 points
2
Very interesting!
It would be interesting to know what the original reward models would say here—does the “screaming” score well according to the model of what humans would reward (or what human demonstrations would contain, depending on type of reward model)?
My suspicion is that the model has learned that apologizing, expressing distress etc after making a mistake is useful for getting reward. And also that you are doing some cherrypicking.
At the risk of making people do more morally grey things, have you considered doing a similar experiment with models finetuned on a math task (perhaps with a similar reward model setup for ease of comparison), which you then sabotage by trying to force them to give the wrong answer?