Ok, the softer constraints make sense to me, thanks.
Using a debate with f+ assessing simple closed questions makes sense, but it seems to me that only moves much of the problem rather than solving it. We start with “answering honestly vs predicting human answers” and end up with “judging honestly vs predicting human judgments”.
While “Which answer is better, Alice’s or Bob’s?” is a closed question, learning to answer the general case still requires applying a full model of human values—so it seems a judge-model is likely to be instrumental (or essentially equivalent: again, I’m not really sure what we’d mean by an intended model for the judge).
But perhaps I’m missing something here; is predicting-the-judge less of a problem than the original? Are there better approaches than using debate which wouldn’t have analogous issues?
Ok, the softer constraints make sense to me, thanks.
Using a debate with f+ assessing simple closed questions makes sense, but it seems to me that only moves much of the problem rather than solving it. We start with “answering honestly vs predicting human answers” and end up with “judging honestly vs predicting human judgments”.
While “Which answer is better, Alice’s or Bob’s?” is a closed question, learning to answer the general case still requires applying a full model of human values—so it seems a judge-model is likely to be instrumental (or essentially equivalent: again, I’m not really sure what we’d mean by an intended model for the judge).
But perhaps I’m missing something here; is predicting-the-judge less of a problem than the original? Are there better approaches than using debate which wouldn’t have analogous issues?