Questions about a topic that I don’t know about result in me just putting the max entropy distribution on that question, which is fine if it’s rare, but leads to unhelpful results if they make up a large proportion of all the questions. Most calibration tests I found pulled from generic trivia categories such as sports, politics, celebrities, science, and geography. I didn’t find many that were domain-specific, so that might be a good area to focus on.
Some of them don’t tell me what the right answers are at the end, or even which questions I got wrong, which I found unsatisfying. If there’s a question that I marked as 95% and got wrong, I’d like to know what it was so that I can look into that topic further.
It’s easiest to get people to answer small numbers of questions (<50), but that leads to a lot of noise in the results. A perfectly calibrated human answering 25 questions at 70% confidence could easily get 80% or 60% of them right and show up as miscalibrated. Incorporating statistical techniques to prevent that would be good. (For example, calculate the standard deviation for that number of questions at that confidence level, and only tell the user that they’re over/under confident if they fall outside it.) The fifth one in my list above does something neat where they say “Your chance of being well calibrated, relative to the null hypothesis, is X percent”. I’m not sure how that’s calculated though.
Questions about a topic that I don’t know about result in me just putting the max entropy distribution on that question, which is fine if it’s rare, but leads to unhelpful results if they make up a large proportion of all the questions. Most calibration tests I found pulled from generic trivia categories such as sports, politics, celebrities, science, and geography. I didn’t find many that were domain-specific, so that might be a good area to focus on.
Some of them don’t tell me what the right answers are at the end, or even which questions I got wrong, which I found unsatisfying. If there’s a question that I marked as 95% and got wrong, I’d like to know what it was so that I can look into that topic further.
It’s easiest to get people to answer small numbers of questions (<50), but that leads to a lot of noise in the results. A perfectly calibrated human answering 25 questions at 70% confidence could easily get 80% or 60% of them right and show up as miscalibrated. Incorporating statistical techniques to prevent that would be good. (For example, calculate the standard deviation for that number of questions at that confidence level, and only tell the user that they’re over/under confident if they fall outside it.) The fifth one in my list above does something neat where they say “Your chance of being well calibrated, relative to the null hypothesis, is X percent”. I’m not sure how that’s calculated though.