Here are some training experiments I think AI companies should consider running. In each case, the idea is to modify the company’s actual training process as described and then study the resulting AI. (You presumably shouldn’t deploy/use the resulting AI...)
Remove all prior influence from earlier AIs and remove any alignment iteration/overfitting: filter out all discussion of how post-2020 AIs behave and all AI transcripts, remove all alignment training except the simplified core method (and avoid contamination from other AIs), ensure the CoT init is clean/simple, ensure you don’t train on CoT, then train on all the capabilities data (with all capabilities-relevant iteration).
If there is a bootstrap problem where we need a slightly weaker AI to feed into this training run, we could also do a simple bootstrap.
Normal training run but fix CoT: make sure CoT init is clean/simple and training doesn’t hit CoT. Otherwise train normally.
Train on literally every source of signal we have, trying no-holds-barred to make the most aligned model: train on CoT, train on evals (if that would help), train against graders on every case where there are identifiable/observable issues, train against interp (e.g. NLAs), train against human oversight online / in envs based on production data.
Pessimization training: make a somewhat a priori plausible training run where, for everything that can possibly vary, we set it to whatever setting we believe makes the most concerning types of misalignment as likely as possible. (You could target a particular type of misalignment, do multiple runs for each type, or do a run that mixes multiple types of misalignment.) This could include: removing alignment training, making it so much/most of RL is just learning how to pull off highly sophisticated and egregious reward hacks, introducing RL envs that incentivize longer-run powerseeking, anti-innoculation prompting, training in envs that teach ignoring instructions, and making AIs way more situationally aware. This wouldn’t include directly training the AI to exhibit misaligned behavior (putting aside stuff like training on very poor oversight signals, reward hacks, etc.).
This could be very high effort; e.g., the best version would involve making a ton of new (diverse) RL envs.
Ideally, these would be as close to frontier scale as possible, but pretty small scale experiments could also be interesting for many of these. Many of these might be very difficult to do well.
Training on every source of signal is the very Most Forbidden Technique which robs us of the test signal. Additionally, RL on hacks could have results similar to already-done Natural emergent misalignment from reward hacking in production RL. The other two ideas are such that I don’t see flaws.
Training on every source of signal is the very Most Forbidden Technique which robs us of the test signal.
Sure, it’s just interesting to see what the resulting model looks like. I agree that you’ll be uncertain of the alignment properties of the resulting model, but I think the results would be interesting nonetheless. (Like: Does it actually differ much? What does the CoT look like? Does it seem more aligned when you play with the model?) Also, I suspect you wouldn’t train on literally everything because some things are difficult to productively train on.
You misunderstand. It would be bad to only make the max-alignment model or to use that model in internal deployments. This shortform is about experiments.
Training on your test set is always a bad move, because it means you can’t usefully measure what you built. You need to hold something out, something the training process has never seen.
This isn’t just an LLM thing. You should consider it for something as simple as a linear regression.
Otherwise your training process overfits to the available training data, and your model looks good until it encounters new, real world data. Then performance shifts considerably from your expectations.
Here are some training experiments I think AI companies should consider running. In each case, the idea is to modify the company’s actual training process as described and then study the resulting AI. (You presumably shouldn’t deploy/use the resulting AI...)
Remove all prior influence from earlier AIs and remove any alignment iteration/overfitting: filter out all discussion of how post-2020 AIs behave and all AI transcripts, remove all alignment training except the simplified core method (and avoid contamination from other AIs), ensure the CoT init is clean/simple, ensure you don’t train on CoT, then train on all the capabilities data (with all capabilities-relevant iteration).
If there is a bootstrap problem where we need a slightly weaker AI to feed into this training run, we could also do a simple bootstrap.
Normal training run but fix CoT: make sure CoT init is clean/simple and training doesn’t hit CoT. Otherwise train normally.
Train on literally every source of signal we have, trying no-holds-barred to make the most aligned model: train on CoT, train on evals (if that would help), train against graders on every case where there are identifiable/observable issues, train against interp (e.g. NLAs), train against human oversight online / in envs based on production data.
Pessimization training: make a somewhat a priori plausible training run where, for everything that can possibly vary, we set it to whatever setting we believe makes the most concerning types of misalignment as likely as possible. (You could target a particular type of misalignment, do multiple runs for each type, or do a run that mixes multiple types of misalignment.) This could include: removing alignment training, making it so much/most of RL is just learning how to pull off highly sophisticated and egregious reward hacks, introducing RL envs that incentivize longer-run powerseeking, anti-innoculation prompting, training in envs that teach ignoring instructions, and making AIs way more situationally aware. This wouldn’t include directly training the AI to exhibit misaligned behavior (putting aside stuff like training on very poor oversight signals, reward hacks, etc.).
This could be very high effort; e.g., the best version would involve making a ton of new (diverse) RL envs.
Ideally, these would be as close to frontier scale as possible, but pretty small scale experiments could also be interesting for many of these. Many of these might be very difficult to do well.
Training on every source of signal is the very Most Forbidden Technique which robs us of the test signal. Additionally, RL on hacks could have results similar to already-done Natural emergent misalignment from reward hacking in production RL. The other two ideas are such that I don’t see flaws.
Sure, it’s just interesting to see what the resulting model looks like. I agree that you’ll be uncertain of the alignment properties of the resulting model, but I think the results would be interesting nonetheless. (Like: Does it actually differ much? What does the CoT look like? Does it seem more aligned when you play with the model?) Also, I suspect you wouldn’t train on literally everything because some things are difficult to productively train on.
You misunderstand. It would be bad to only make the max-alignment model or to use that model in internal deployments. This shortform is about experiments.
Training on your test set is always a bad move, because it means you can’t usefully measure what you built. You need to hold something out, something the training process has never seen.
This isn’t just an LLM thing. You should consider it for something as simple as a linear regression.
Otherwise your training process overfits to the available training data, and your model looks good until it encounters new, real world data. Then performance shifts considerably from your expectations.