Training on every source of signal is the very Most Forbidden Technique which robs us of the test signal. Additionally, RL on hacks could have results similar to already-done Natural emergent misalignment from reward hacking in production RL. The other two ideas are such that I don’t see flaws.
Training on every source of signal is the very Most Forbidden Technique which robs us of the test signal.
Sure, it’s just interesting to see what the resulting model looks like. I agree that you’ll be uncertain of the alignment properties of the resulting model, but I think the results would be interesting nonetheless. (Like: Does it actually differ much? What does the CoT look like? Does it seem more aligned when you play with the model?) Also, I suspect you wouldn’t train on literally everything because some things are difficult to productively train on.
You misunderstand. It would be bad to only make the max-alignment model or to use that model in internal deployments. This shortform is about experiments.
Training on your test set is always a bad move, because it means you can’t usefully measure what you built. You need to hold something out, something the training process has never seen.
This isn’t just an LLM thing. You should consider it for something as simple as a linear regression.
Otherwise your training process overfits to the available training data, and your model looks good until it encounters new, real world data. Then performance shifts considerably from your expectations.
Training on every source of signal is the very Most Forbidden Technique which robs us of the test signal. Additionally, RL on hacks could have results similar to already-done Natural emergent misalignment from reward hacking in production RL. The other two ideas are such that I don’t see flaws.
Sure, it’s just interesting to see what the resulting model looks like. I agree that you’ll be uncertain of the alignment properties of the resulting model, but I think the results would be interesting nonetheless. (Like: Does it actually differ much? What does the CoT look like? Does it seem more aligned when you play with the model?) Also, I suspect you wouldn’t train on literally everything because some things are difficult to productively train on.
You misunderstand. It would be bad to only make the max-alignment model or to use that model in internal deployments. This shortform is about experiments.
Training on your test set is always a bad move, because it means you can’t usefully measure what you built. You need to hold something out, something the training process has never seen.
This isn’t just an LLM thing. You should consider it for something as simple as a linear regression.
Otherwise your training process overfits to the available training data, and your model looks good until it encounters new, real world data. Then performance shifts considerably from your expectations.