Thinking about this led me to write AI Mistake Seeding, which suggests some reasons for why modern AI models might be more reward-hacky and misaligned. I have also noticed this misalignment in daily use and think it’s a real issue.
Fascinating. To summarize that post here: the hypothesis is that models are being accidentally trained to produce easy-to-fix mistakes, so they can get bigger rewards later by fixing them. This isn’t a straightforward first-order effect of the way we think labs are training models, but it’s a possible second-order effect of some methods that sound plausibly useful and that create an inner- and out-loop structure of training.
I think my explanation, using another instance to estimate success/reward inventivies success exaggeration, is more straightforward and likely the larger part of what’s going on; but I recommend that post as another interesting possibility, that may be included in future training procedures even if it’s not happening now.
It’s a bit hard to evaluate the logic, but I think one way to evaluate is to compare amounts of this behavior across different model families, and estimate the odds that more than one dev has adopted training procedures that produce that inner-&-outer-loop structure at once, vs. they just all adopted estimating rewards recently when models became capable of improving their answers at well above chance.
Thanks for the summary. I think I originally had two hypotheses:
Models are being allowed to adopt non-greedy strategies for RL due to some outer-loop setup, and an environment which favors a make-mistakes-on-purpose-to-fix-them-later strategy (mistake seeding).
Models are somehow cheating when the same model is also the judge/reward estimator in an RL setup
But I was struggling with the logic for the second one, and how that could specifically produce mistake-seeding behaviour. I couldn’t figure it out. It seemed to me like using the model as its own judge could cause pseudo-random unwanted drift in values, and allow bad behaviour to slip through, but I didn’t see why it should produce mistake seeding or other more intelligent forms of cheating without some kind of outer loop.
The reason I think the first hypothesis also makes sense with AI industry timing is that there’s been a huge push toward synthetic training data, and I think synthetic prompts are one vehicle for outer loops appearing.
Thinking about this led me to write AI Mistake Seeding, which suggests some reasons for why modern AI models might be more reward-hacky and misaligned. I have also noticed this misalignment in daily use and think it’s a real issue.
Fascinating. To summarize that post here: the hypothesis is that models are being accidentally trained to produce easy-to-fix mistakes, so they can get bigger rewards later by fixing them. This isn’t a straightforward first-order effect of the way we think labs are training models, but it’s a possible second-order effect of some methods that sound plausibly useful and that create an inner- and out-loop structure of training.
I think my explanation, using another instance to estimate success/reward inventivies success exaggeration, is more straightforward and likely the larger part of what’s going on; but I recommend that post as another interesting possibility, that may be included in future training procedures even if it’s not happening now.
It’s a bit hard to evaluate the logic, but I think one way to evaluate is to compare amounts of this behavior across different model families, and estimate the odds that more than one dev has adopted training procedures that produce that inner-&-outer-loop structure at once, vs. they just all adopted estimating rewards recently when models became capable of improving their answers at well above chance.
Thanks for the summary. I think I originally had two hypotheses:
Models are being allowed to adopt non-greedy strategies for RL due to some outer-loop setup, and an environment which favors a make-mistakes-on-purpose-to-fix-them-later strategy (mistake seeding).
Models are somehow cheating when the same model is also the judge/reward estimator in an RL setup
But I was struggling with the logic for the second one, and how that could specifically produce mistake-seeding behaviour. I couldn’t figure it out. It seemed to me like using the model as its own judge could cause pseudo-random unwanted drift in values, and allow bad behaviour to slip through, but I didn’t see why it should produce mistake seeding or other more intelligent forms of cheating without some kind of outer loop.
The reason I think the first hypothesis also makes sense with AI industry timing is that there’s been a huge push toward synthetic training data, and I think synthetic prompts are one vehicle for outer loops appearing.
I do agree that a weak-minded judge can easily be tricked, and that this could make problems happening in RL worse. This is why I recently advocated for paying high IQ people to do RLHF.