StanislavKrym comments on ryan_greenblatt’s Shortform

StanislavKrym 8 May 2026 1:43 UTC
4 points
0
Training on every source of signal is the very Most Forbidden Technique which robs us of the test signal. Additionally, RL on hacks could have results similar to already-done Natural emergent misalignment from reward hacking in production RL. The other two ideas are such that I don’t see flaws.
- ryan_greenblatt 8 May 2026 1:54 UTC
  23 points
  19
  Parent
  
  Training on every source of signal is the very Most Forbidden Technique which robs us of the test signal.
  
  Sure, it’s just interesting to see what the resulting model looks like. I agree that you’ll be uncertain of the alignment properties of the resulting model, but I think the results would be interesting nonetheless. (Like: Does it actually differ much? What does the CoT look like? Does it seem more aligned when you play with the model?) Also, I suspect you wouldn’t train on literally everything because some things are difficult to productively train on.
- Zach Stein-Perlman 8 May 2026 1:51 UTC
  13 points
  4
  Parent
  You misunderstand. It would be bad to only make the max-alignment model or to use that model in internal deployments. This shortform is about experiments.
  - Random Developer 8 May 2026 9:53 UTC
    0 points
    0
    Parent
    Training on your test set is always a bad move, because it means you can’t usefully measure what you built. You need to hold something out, something the training process has never seen.
    
    This isn’t just an LLM thing. You should consider it for something as simple as a linear regression.
    
    Otherwise your training process overfits to the available training data, and your model looks good until it encounters new, real world data. Then performance shifts considerably from your expectations.