Am I interpreting you correctly that the responses of both Opus 4 and o3 here are wrong according to the theorem?
Also would the following restatement of the theorem be a correct understanding? The student model can’t ever become worse (according to the teacher) when fine tuned on (any) ouputs from the teacher, on any distribution.
This comment articulates the main thought I was having reading this post. I wonder how Buck is avoiding this very trap, and if there is any hope at all of the Moderate strategy overcoming this problem?