Bogdan Ionut Cirstea comments on Daniel Kokotajlo’s Shortform

Bogdan Ionut Cirstea 2 Feb 2025 20:53 UTC
LW: 3 AF: 1
0
AF
I pretty much agree with 1 and 2. I’m much more optimistic about 3-5 even ‘by default’ (e.g. R1′s training being ‘regularized’ towards more interpretable CoT, despite DeepSeek not being too vocal about safety), but especially if labs deliberately try for maintaining the nice properties from 1-2 and of interpretable CoT.
- Daniel Kokotajlo 2 Feb 2025 21:07 UTC
  7 points
  0
  Parent
  (e.g. R1′s training being ‘regularized’ towards more interpretable CoT, despite DeepSeek not being too vocal about safety)
  This is bad actually. They are mixing process-based and outcome-based feedback. I think the particular way they did it (penalizing CoT that switches between languages) isn’t so bad, but it’s still a shame because the point of faithful CoT is to see how the model really thinks ‘naturally.’ Training the CoT to look a certain way is like training on the test set, so to speak. It muddies the results. If they hadn’t done that, then we could learn something interesting probably by analyzing the patterns in when it uses english vs. chinese language concepts.
  - Bogdan Ionut Cirstea 4 Feb 2025 22:51 UTC
    2 points
    0
    Parent
    I agree it’s bad news w.r.t. getting maximal evidence about steganography and the like happening ‘by default’. I think it’s good news w.r.t. lab incentives, even for labs which don’t speak too much about safety.
- Bogdan Ionut Cirstea 4 Feb 2025 22:53 UTC
  LW: 4 AF: 2
  0
  AF Parent
  I also think it’s important to notice how much less scary / how much more probably-easy-to-mitigate (at least strictly when it comes to technical alignment) this story seems than the scenarios from 10 years ago or so, e.g. from Superintelligence / from before LLMs, when pure RL seemed like the dominant paradigm to get to AGI.
  - Daniel Kokotajlo 4 Feb 2025 23:30 UTC
    LW: 2 AF: 2
    0
    AF Parent
    I don’t think it’s that much better actually. It might even be worse. See this comment: