Another approach is to make the strong student (the AI system) generalize correctly from the imperfect labels provided by the weak teacher. The hope for these weak-to-strong generalization techniques is that we can do better than naively relying on unreliable feedback from a weak overseer and instead access the latent, greater, capabilities that our AI system has, perhaps by a simple modification of the training objective.
I think we should allow the strong system to use the Scientific Method. So, if it has multiple incompatible hypotheses about what the weaker system wants and why that it needs to distinguish between, then it can do an experiment: devise useful new test cases that a) should help it distinguish between the hypotheses, and b) it believes are simple enough for the weak system to label well.
Clearly to do this, it first needs to be able to determine this fact about itself, that it’s uncertain — the most obvious way to do this for an LLM would be an ensemble of models, but hopefully there are less expensive approaches to determining uncertainty (though those would need to be applied to steps in the model’s CoT reasoning about of the causes of labels from the weaker model, not just to factual questions,)
Then once the experiment is done, we need it to be able to do an approximate-Bayesian update on the results: many training methods implement approximate Bayesian inference, so this part seems easy enough.
The exciting thing about this is it lets us do “mistake theory”: e.g. the strong student is highly sure that the weak teacher will prefer response B over response A to a question because response B skillfully uses subtle flattery (or some other rhetorical device) while A does not. Using some combination of its detailed knowledge of human values and preferences, and/or asking questions of the weaker teacher, and/or debate, and/or constitutional AI, it can identify flattery as a technique for inducing a mistake from the weaker teacher, not a valid way of actually making the answer better, and thus stop doing this, even though it works (unlike RL).
I think we should allow the strong system to use the Scientific Method. So, if it has multiple incompatible hypotheses about what the weaker system wants and why that it needs to distinguish between, then it can do an experiment: devise useful new test cases that a) should help it distinguish between the hypotheses, and b) it believes are simple enough for the weak system to label well.
Clearly to do this, it first needs to be able to determine this fact about itself, that it’s uncertain — the most obvious way to do this for an LLM would be an ensemble of models, but hopefully there are less expensive approaches to determining uncertainty (though those would need to be applied to steps in the model’s CoT reasoning about of the causes of labels from the weaker model, not just to factual questions,)
Then once the experiment is done, we need it to be able to do an approximate-Bayesian update on the results: many training methods implement approximate Bayesian inference, so this part seems easy enough.
The exciting thing about this is it lets us do “mistake theory”: e.g. the strong student is highly sure that the weak teacher will prefer response B over response A to a question because response B skillfully uses subtle flattery (or some other rhetorical device) while A does not. Using some combination of its detailed knowledge of human values and preferences, and/or asking questions of the weaker teacher, and/or debate, and/or constitutional AI, it can identify flattery as a technique for inducing a mistake from the weaker teacher, not a valid way of actually making the answer better, and thus stop doing this, even though it works (unlike RL).