In part (c), one of the general classes of methods you propose, for the goal of getting models to avoid conditioning their behaviour on their knowledge that they are in an evaluation, is:
One idea involves training models to behave ‘deontologically’ with respect to evaluations (i.e. disregard the possible consequences of how their behaviour may be evaluated). This aims to directly reduce evaluation gaming without needing to hide evaluation cues.
Do you have anything concrete in mind here? It seems like you are saying that we should avoid models conditioning their behaviour on their knowledge that they are in evaluation by somehow getting them not to condition their behaviour on their knowledge that they are in evaluation. This seems like it’s just a restatement of the goal. (But I could be missing something.)
Generally, I’m worried about proposals that say we should try to make evaluations more robust by making our models less likely to engage in strategic behaviour for misaligned purposes. At least for an important class of evaluations, if we had models we could trust not to engage in strategic behaviour for misaligned purposes, we wouldn’t need to evaluate them. The point of these evaluations should be that it we don’t have to trust the models not to engage in strategic behaviour for misaligned purposes, because if they did, the evaluations would catch them. Evaluations that assume this is not the case are not useful. (I suppose there are some classes of evaluations that might be able to get away with this assumption, because they are testing something orthogonal.)
In part (c), one of the general classes of methods you propose, for the goal of getting models to avoid conditioning their behaviour on their knowledge that they are in an evaluation, is:
Do you have anything concrete in mind here? It seems like you are saying that we should avoid models conditioning their behaviour on their knowledge that they are in evaluation by somehow getting them not to condition their behaviour on their knowledge that they are in evaluation. This seems like it’s just a restatement of the goal. (But I could be missing something.)
Generally, I’m worried about proposals that say we should try to make evaluations more robust by making our models less likely to engage in strategic behaviour for misaligned purposes. At least for an important class of evaluations, if we had models we could trust not to engage in strategic behaviour for misaligned purposes, we wouldn’t need to evaluate them. The point of these evaluations should be that it we don’t have to trust the models not to engage in strategic behaviour for misaligned purposes, because if they did, the evaluations would catch them. Evaluations that assume this is not the case are not useful. (I suppose there are some classes of evaluations that might be able to get away with this assumption, because they are testing something orthogonal.)