They don’t cite the de-calibration result from the GPT-4 paper, but the distribution of GPT-4′s ratings here looks like it’s been tuned to be mealy-mouthed: humped at 60%, so it agrees with whatever you say but then can’t even do so enthusiastically https://arxiv.org/pdf/2310.13014.pdf#page=6 .
They don’t cite the de-calibration result from the GPT-4 paper, but the distribution of GPT-4′s ratings here looks like it’s been tuned to be mealy-mouthed: humped at 60%, so it agrees with whatever you say but then can’t even do so enthusiastically https://arxiv.org/pdf/2310.13014.pdf#page=6 .