(e.g. R1′s training being ‘regularized’ towards more interpretable CoT, despite DeepSeek not being too vocal about safety)
This is bad actually. They are mixing process-based and outcome-based feedback. I think the particular way they did it (penalizing CoT that switches between languages) isn’t so bad, but it’s still a shame because the point of faithful CoT is to see how the model really thinks ‘naturally.’ Training the CoT to look a certain way is like training on the test set, so to speak. It muddies the results. If they hadn’t done that, then we could learn something interesting probably by analyzing the patterns in when it uses english vs. chinese language concepts.
I agree it’s bad news w.r.t. getting maximal evidence about steganography and the like happening ‘by default’. I think it’s good news w.r.t. lab incentives, even for labs which don’t speak too much about safety.
This is bad actually. They are mixing process-based and outcome-based feedback. I think the particular way they did it (penalizing CoT that switches between languages) isn’t so bad, but it’s still a shame because the point of faithful CoT is to see how the model really thinks ‘naturally.’ Training the CoT to look a certain way is like training on the test set, so to speak. It muddies the results. If they hadn’t done that, then we could learn something interesting probably by analyzing the patterns in when it uses english vs. chinese language concepts.
I agree it’s bad news w.r.t. getting maximal evidence about steganography and the like happening ‘by default’. I think it’s good news w.r.t. lab incentives, even for labs which don’t speak too much about safety.