Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 2 Feb 2025 21:07 UTC
7 points
0
(e.g. R1′s training being ‘regularized’ towards more interpretable CoT, despite DeepSeek not being too vocal about safety)
This is bad actually. They are mixing process-based and outcome-based feedback. I think the particular way they did it (penalizing CoT that switches between languages) isn’t so bad, but it’s still a shame because the point of faithful CoT is to see how the model really thinks ‘naturally.’ Training the CoT to look a certain way is like training on the test set, so to speak. It muddies the results. If they hadn’t done that, then we could learn something interesting probably by analyzing the patterns in when it uses english vs. chinese language concepts.
- Bogdan Ionut Cirstea 4 Feb 2025 22:51 UTC
  2 points
  0
  Parent
  I agree it’s bad news w.r.t. getting maximal evidence about steganography and the like happening ‘by default’. I think it’s good news w.r.t. lab incentives, even for labs which don’t speak too much about safety.