I still think this is a serious concern that prevents us from finding “the truth predictor”, but here is a way around it:
is_it_bad(x) = max_predictor consistency(predictor) subject to the constraint that predictor(x) ≠ predictor(obvious good inputs)
For x such that all coherent (and high salience) views predict it is good, then it won’t be possible to find a consistent and salient predictor that predicts that x has a different label from obvious good inputs and thus is_it_bad(x) will be low;
For x that the model thinks is bad (regardless of what other coherent views think, and in particular coherent human simulators!), then is_it_bad(x) will be high!
For x such that some coherent view (e.g. the human simulator) thinks is bad, then is_it_bad(x) will be high.
The proposed is_it_bad has no false negatives. It has false positives (in situations where the human simulator predicts the input is bad but the model knows it’s fine), but I think this is might fine for many applications (as long as we correctly use some prior/saliency/simplicity properties to exclude coherent predictors that are not interesting without excluding the ground truth). In particular, it might be outer-alignment safe-ish to maximize for outcome(x) - large_constant * is_it_bad(x).
I still think this is a serious concern that prevents us from finding “the truth predictor”, but here is a way around it:
is_it_bad(x) = max_predictor consistency(predictor) subject to the constraint that predictor(x) ≠ predictor(obvious good inputs)
For x such that all coherent (and high salience) views predict it is good, then it won’t be possible to find a consistent and salient predictor that predicts that x has a different label from obvious good inputs and thus is_it_bad(x) will be low;
For x that the model thinks is bad (regardless of what other coherent views think, and in particular coherent human simulators!), then is_it_bad(x) will be high!
For x such that some coherent view (e.g. the human simulator) thinks is bad, then is_it_bad(x) will be high.
The proposed is_it_bad has no false negatives. It has false positives (in situations where the human simulator predicts the input is bad but the model knows it’s fine), but I think this is might fine for many applications (as long as we correctly use some prior/saliency/simplicity properties to exclude coherent predictors that are not interesting without excluding the ground truth). In particular, it might be outer-alignment safe-ish to maximize for outcome(x) - large_constant * is_it_bad(x).
Clarification: the reason why this bypasses the concern is that you can compute the max with training instead of enumeration.