I will also point to OpenAI’s weak-to-strong results, where increasingly strong students keep improving generalization given labels from a fixed-size teacher. We just don’t live in a world where this issue is a lethality.
For a fixed weak teacher and increasing stronger students from a fixed model stack[1], I think you can probably avoid performance ever going down on most/typical tasks if you properly use early stopping, only use process based feedback, and the model isn’t intentionally trying to perform poorly.
You might have instead expected performance to go up and then eventually go down with scale, but I think you can likely avoid this with early stopping (if you carefully find the right stopping point with scaling laws and analogous validation domains where we can ensure we get good labels or other methods of getting validation).
If I recall, I think we also see something similar in the scaling laws for reward model overoptimization work by Leo Gao (also done at OpenAI). (I think this is probably a more analogous case in most ways than the weak-to-strong results from OpenAI as far as understanding the dynamics of fitting to human errors.)
(Let’s put aside the case where the model intentionally tries to perform poorly. (I’m not even sure this case actually looks that different, but it certainly complicates the analysis. I’m doing some work on this case looking at model organisms of intentionally poor performance, and I expect that for these exact model organisms, we’ll probably see performance going up and then back down again with scale in at least some cases.))
(To be clear, I don’t think this “performance never goes down with correct early stopping” claim is totally obvious. It will depend on the exact rate at which AIs learn to predict errors vs learn what the task is and how to do it. And how these rates evolve with scale. If the sigmoid on error learning rate vs scale has a different mid point and different slope than the sigmoid for learning the task, you can absolutely have actual performance go down.)
While I think you can avoid having the performance ever degrade via process based feedback using scaling laws on most/typical tasks, I think the performance will also plateau due to how early you’ll end up needing to stop.
More concerningly, there might be alteratives to purely process-based human labels which don’t plateau in performance, seem to increase performance, but which rarely results in egregiously bad behavior. Most notably various types of outcomes based feedback might have this property.
As an example of this, I expect that you can create “hackable/exploitable” game environments to exhibit this. More specifically:
We’ll pretrain models on a collection of hackable game envs. We’ll train a model stack of variable training compute.
We’ll finetune these models in a new different (but still hackable) atari env where we expect to see transfer from the prior atari envs.
It’s likely to me that as models get smarter if exploiting is ultimately a better strategy, final finetuned performance goes down even with early stopping.
You might be able to see this on some atari games with added semi-realistic exploits? I’m unsure.
(An aside: I edited my original comment to clarify that I was saying “These results show that #20 is not some obvious lethality which merits the name”, but I still certainly think that labeling mistakes can and will make some things go meaningfully wrong.)
For a fixed weak teacher and increasing stronger students from a fixed model stack[1], I think you can probably avoid performance ever going down on most/typical tasks if you properly use early stopping, only use process based feedback, and the model isn’t intentionally trying to perform poorly.
You might have instead expected performance to go up and then eventually go down with scale, but I think you can likely avoid this with early stopping (if you carefully find the right stopping point with scaling laws and analogous validation domains where we can ensure we get good labels or other methods of getting validation).
If I recall, I think we also see something similar in the scaling laws for reward model overoptimization work by Leo Gao (also done at OpenAI). (I think this is probably a more analogous case in most ways than the weak-to-strong results from OpenAI as far as understanding the dynamics of fitting to human errors.)
(Let’s put aside the case where the model intentionally tries to perform poorly. (I’m not even sure this case actually looks that different, but it certainly complicates the analysis. I’m doing some work on this case looking at model organisms of intentionally poor performance, and I expect that for these exact model organisms, we’ll probably see performance going up and then back down again with scale in at least some cases.))
(To be clear, I don’t think this “performance never goes down with correct early stopping” claim is totally obvious. It will depend on the exact rate at which AIs learn to predict errors vs learn what the task is and how to do it. And how these rates evolve with scale. If the sigmoid on error learning rate vs scale has a different mid point and different slope than the sigmoid for learning the task, you can absolutely have actual performance go down.)
While I think you can avoid having the performance ever degrade via process based feedback using scaling laws on most/typical tasks, I think the performance will also plateau due to how early you’ll end up needing to stop.
More concerningly, there might be alteratives to purely process-based human labels which don’t plateau in performance, seem to increase performance, but which rarely results in egregiously bad behavior. Most notably various types of outcomes based feedback might have this property.
As an example of this, I expect that you can create “hackable/exploitable” game environments to exhibit this. More specifically:
We’ll pretrain models on a collection of hackable game envs. We’ll train a model stack of variable training compute.
We’ll finetune these models in a new different (but still hackable) atari env where we expect to see transfer from the prior atari envs.
It’s likely to me that as models get smarter if exploiting is ultimately a better strategy, final finetuned performance goes down even with early stopping.
You might be able to see this on some atari games with added semi-realistic exploits? I’m unsure.
As in, you just vary the training compute and set all other values optimally based on this.
(An aside: I edited my original comment to clarify that I was saying “These results show that #20 is not some obvious lethality which merits the name”, but I still certainly think that labeling mistakes can and will make some things go meaningfully wrong.)