The calibration question is an n=1 sample on one of the two important axes (those axes being who’s answering, and what question they’re answering). Give a question that’s harder than it looks, and people will come out overconfident on average; give a question that’s easier than it looks, and they’ll come out underconfident on average. Getting rid of this effect requires a pool of questions, so that it’ll average out.
Yep. (Or as Yvain suggests, give a question which is likely to be answered with a bias in a particular direction.)
It’s not clear what you can conclude from the fact that 17% of all people who answered a single question at 50% confidence got it right, but you can’t conclude from it that if you asked one of these people a hundred binary questions and they answered “yes” at 50% confidence, that person would only get 17% right. The latter is what would deserve to be called “atrocious”; I don’t believe the adjective applies to the results observed in the survey.
I’m not even sure that you can draw the conclusion “not everyone in the sample is perfectly calibrated” from these results. Well, the people who were 100% sure they were wrong, and happened to be correct, are definitely not perfectly calibrated; but I’m not sure what we can say of the rest.
I have often pondered this problem with respect to some of the traditional heuristics and biases studies, e.g. the “above-average driver” effect. If people consult their experiences of subjective difficulty at doing a task, and then guess they are above average for the ones that feel easy, and below average for the ones that feel hard, this will to some degree track their actual particular strengths and weaknesses. Plausibly a heuristic along these lines gives overall better predictions than guessing “I am average” about everything.
However, if we focus in on activities that happen to be unusually easy-feeling or hard-feeling in general, then we can make the heuristics look bad by only showing their successes and not their failures. Although the name “heuristics and biases” does reflect this notion: we have heuristics because they usually work, but they produce biases in some cases as an acceptable loss.
I would agree that this explains the apparent atrocious calibration. It’s worth an edit to the main post. No reason to beat ourselves up needlessly.
People were answering different questions in the sense that they each had an interval of their own choosing to assign a probability to, but obviously different people’s performance here was going to be strongly correlated. Bayes just happens to be the kind of guy who was born surprisingly early. If everyone had literally been asked to assign a probability to the exact same proposition, like “Bayes was born before 1750” or “this coin will come up heads”, that would have been a more extreme case. We’d have found that events that people predicted with probability x% actually happened either 0% or 100% of the time, and it wouldn’t mean people were infinitely badly calibrated.
Yes, and this is probably worth an edit to the original post. For a more extreme example, consider what would happen if you asked a large group of people to assess the probability that the same coin would come up heads. You’d find that events that people said would happen 50% of the time happened either 0% or 100% of the time, but it would be wrong to conclude they were atrociously calibrated.
The calibration question is an n=1 sample on one of the two important axes (those axes being who’s answering, and what question they’re answering). Give a question that’s harder than it looks, and people will come out overconfident on average; give a question that’s easier than it looks, and they’ll come out underconfident on average. Getting rid of this effect requires a pool of questions, so that it’ll average out.
Yep. (Or as Yvain suggests, give a question which is likely to be answered with a bias in a particular direction.)
It’s not clear what you can conclude from the fact that 17% of all people who answered a single question at 50% confidence got it right, but you can’t conclude from it that if you asked one of these people a hundred binary questions and they answered “yes” at 50% confidence, that person would only get 17% right. The latter is what would deserve to be called “atrocious”; I don’t believe the adjective applies to the results observed in the survey.
I’m not even sure that you can draw the conclusion “not everyone in the sample is perfectly calibrated” from these results. Well, the people who were 100% sure they were wrong, and happened to be correct, are definitely not perfectly calibrated; but I’m not sure what we can say of the rest.
I have often pondered this problem with respect to some of the traditional heuristics and biases studies, e.g. the “above-average driver” effect. If people consult their experiences of subjective difficulty at doing a task, and then guess they are above average for the ones that feel easy, and below average for the ones that feel hard, this will to some degree track their actual particular strengths and weaknesses. Plausibly a heuristic along these lines gives overall better predictions than guessing “I am average” about everything.
However, if we focus in on activities that happen to be unusually easy-feeling or hard-feeling in general, then we can make the heuristics look bad by only showing their successes and not their failures. Although the name “heuristics and biases” does reflect this notion: we have heuristics because they usually work, but they produce biases in some cases as an acceptable loss.
I would agree that this explains the apparent atrocious calibration. It’s worth an edit to the main post. No reason to beat ourselves up needlessly.
People were answering different questions in the sense that they each had an interval of their own choosing to assign a probability to, but obviously different people’s performance here was going to be strongly correlated. Bayes just happens to be the kind of guy who was born surprisingly early. If everyone had literally been asked to assign a probability to the exact same proposition, like “Bayes was born before 1750” or “this coin will come up heads”, that would have been a more extreme case. We’d have found that events that people predicted with probability x% actually happened either 0% or 100% of the time, and it wouldn’t mean people were infinitely badly calibrated.
All of that also applies to the year calibration questions in previous surveys and yet people did much better in those.
Because they weren’t about events that occurred surprisingly early.
Yes, and this is probably worth an edit to the original post. For a more extreme example, consider what would happen if you asked a large group of people to assess the probability that the same coin would come up heads. You’d find that events that people said would happen 50% of the time happened either 0% or 100% of the time, but it would be wrong to conclude they were atrociously calibrated.