I think we should have the strong student use the Scientific Method. Specifically, it should:
Come up with hypotheses about the weak supervisor’s behavior (such as by doing repeated CoT to predict supervisor and ground truth labels for specific cases), and notice when it is uncertain between two-or-more alternative hypotheses (such as by clustering steps from repeated CoTs for the same test case, and looking for clusters that are on the same subject, mutually exclusive in occurrence, and logically incompatible, then clustering these across many test cases).
Prioritize sets of alternative hypotheses (such as by looking at how many predicted ground truth labels would change if we resolved a specific uncertainty).
For each set of alternative hypotheses it prioritized resolving, devise an experiment. The goal is to maximize expected information gain. So create some new test cases for which the strong student would expect that:
seeing labels for these test cases would provide good evidence to hep resolve the uncertainty, i.e. the cases are diagnostic, and
it believes are simple enough that the weak supervisor can label them accurately.
Send those new test cases to the weak supervisor for labeling.
Periodically, perform an approximate Bayesian update on the new results by retraining (or online training) with the new labeled test case results.
This is somewhat similar to the well-known practice, when training a classifier, of oversampling points near the classification boundary: some data points provide more information gain than others. The Scientific Method is the best known approach for achieving this maximization of information gain (a lot better than bandit-style explore-exploit).
Steps 1. and 3. require making some moderately sophisticated logical deductions. With suitable prompting/fintuning, a model of GPT-4 strength should be fairly capable of these (obviously a stronger model would do better).
Step 3. requires that the difficulty of creating a suitable new test case is within the capabilities of the strong student model, which depends upon the nature of the task being trained. In practice, for a strong student of GPT-4 level, I would expect this to generally be feasible for many tasks.
I am here assuming that the strong student is separately predicting both weak supervisor labels and ground truth labels, i.e. that it has a mistake theory about the weak supervisor more sophisticated than random noise, capable of also modeling systematic mistakes. E.g. for the question-answer-pair preference modeling task, predicting that flattery or sycophancy in the answer might increase the labeled score, but that these are not valid ways to improve an answer (something I’d expect GPT-4 to be able to figure out from its knowledge of humans, but if not I’m sure it could figure out how to ask us, so discussions with the humans to check questions like that should be added to the procedure above), so these rhetorical tricks would not increase the predicted ground truth score, only the predicted weak supervisor score. In fact, they should decrease the ground truth score, since humans don’t like being successfully manipulated (once they realize it has happened). So the strong student should learn not to do these things, even though they work in terms of the weak supervisor’s reward model.
I think we should have the strong student use the Scientific Method. Specifically, it should:
Come up with hypotheses about the weak supervisor’s behavior (such as by doing repeated CoT to predict supervisor and ground truth labels for specific cases), and notice when it is uncertain between two-or-more alternative hypotheses (such as by clustering steps from repeated CoTs for the same test case, and looking for clusters that are on the same subject, mutually exclusive in occurrence, and logically incompatible, then clustering these across many test cases).
Prioritize sets of alternative hypotheses (such as by looking at how many predicted ground truth labels would change if we resolved a specific uncertainty).
For each set of alternative hypotheses it prioritized resolving, devise an experiment. The goal is to maximize expected information gain. So create some new test cases for which the strong student would expect that:
seeing labels for these test cases would provide good evidence to hep resolve the uncertainty, i.e. the cases are diagnostic, and
it believes are simple enough that the weak supervisor can label them accurately.
Send those new test cases to the weak supervisor for labeling.
Periodically, perform an approximate Bayesian update on the new results by retraining (or online training) with the new labeled test case results.
This is somewhat similar to the well-known practice, when training a classifier, of oversampling points near the classification boundary: some data points provide more information gain than others. The Scientific Method is the best known approach for achieving this maximization of information gain (a lot better than bandit-style explore-exploit).
Steps 1. and 3. require making some moderately sophisticated logical deductions. With suitable prompting/fintuning, a model of GPT-4 strength should be fairly capable of these (obviously a stronger model would do better).
Step 3. requires that the difficulty of creating a suitable new test case is within the capabilities of the strong student model, which depends upon the nature of the task being trained. In practice, for a strong student of GPT-4 level, I would expect this to generally be feasible for many tasks.
I am here assuming that the strong student is separately predicting both weak supervisor labels and ground truth labels, i.e. that it has a mistake theory about the weak supervisor more sophisticated than random noise, capable of also modeling systematic mistakes. E.g. for the question-answer-pair preference modeling task, predicting that flattery or sycophancy in the answer might increase the labeled score, but that these are not valid ways to improve an answer (something I’d expect GPT-4 to be able to figure out from its knowledge of humans, but if not I’m sure it could figure out how to ask us, so discussions with the humans to check questions like that should be added to the procedure above), so these rhetorical tricks would not increase the predicted ground truth score, only the predicted weak supervisor score. In fact, they should decrease the ground truth score, since humans don’t like being successfully manipulated (once they realize it has happened). So the strong student should learn not to do these things, even though they work in terms of the weak supervisor’s reward model.