I think the problem here is with many trivia questions you either know the answer or you don’t; the dominant factor in my results so far is that I either have no answer in mind, assign 0 probability to my being right and am correctly calibrated there, and then all of my answers at other levels of certainty have turned out right so far so my calibration curve looks almost rectangular.
I might just be getting accurate information that I’m drastically underconfident, but I think this might be one of the worse types of questions to calibrate on. I mean, even if the problem is just that I’m drastically underconfident on trivia questions and shouldn’t be assigning less than 50% probability to any of my answers when I have an answer, that sounds sufficiently unrepresentative of most areas where you need calibration, and how most people perform on other calibration tests, for this to be a pretty bad measure of calibration.
Perhaps it would be better as a multiple choice test, so one can have possible answers raised to attention that may or may not be right, and assign probabilities to those?
My favorite calibration tools have been one where there was a numerical answer and you had to express a 50% confidence interval, or 90% confidence interval.
Like, a question would be how many stairs are there in the Statue of Liberty? And my 50% interval would be 400-1000, and my 90% interval would be 200-5000.
Looking up the answer it was 354, and I would mark my 50% as wrong and my 90% as right.
0% probability is my most common answer as well, but I’m using it less often than I was choosing 50% on the CFAR calibration app (which forces a binary answer choice rather than an open-ended answer choice). The CFAR app has lots of questions like “Which of these two teams won the Superbowl in 1978” where I just have no idea. The trivia database Nanashi is using has, for me, a greater proportion of questions on which my credence is something more interesting than an ignorance prior.
That’s a fair criticism, but if we’re going down this road we’ve also gotta recognize the limitations of a multiple choice calibration test. Both styles suffer from the “You know it or you don’t” dichotomy. If these questions were all multiple choice, you’d still have the same rectangular shaped graph, it would just start at 50% (assuming a binary choice) instead of 0%.
The big difference is the solution sets that the different styles represent. There are plenty of situations in life where there are a few specific courses of action to choose from. But there are also plenty of situations where that’s not the case.
But, I will say that a multiple choice test definitely yields a “pretty” calibration curve much faster than an open-ended test. You’ve got a smaller range of values, and the nature of it lets you more confidently rule out one answer or the other. So the curve will be smoother faster. Whereas this will be pretty bottom heavy for a while.
I think the problem here is with many trivia questions you either know the answer or you don’t
That means that for those questions most probabilities are either close to 0 or close to 1. This suggests that given this set of questions it would probably be a good idea to increase “resolution” near those two points. For that purpose, perhaps instead of asking for confidence levels expressed as percentages you could ask for confidence levels expressed as odds or log odds. For example, users could express their confidence levels using odds expressed as ratios 2^n:1, for n=k,...,0,...,-k.
That’s an interesting thought but I do suspect that you’d have to answer a lot of questions to see any difference whatsoever. If you’re perfectly calibrated and answer 100 questions that you are either 99.99% confident or 99.9% confident, there’s a very good chance that you’ll get all 100 questions right, regardless of which confidence level you pick.
I think the problem here is with many trivia questions you either know the answer or you don’t; the dominant factor in my results so far is that I either have no answer in mind, assign 0 probability to my being right and am correctly calibrated there, and then all of my answers at other levels of certainty have turned out right so far so my calibration curve looks almost rectangular.
I might just be getting accurate information that I’m drastically underconfident, but I think this might be one of the worse types of questions to calibrate on. I mean, even if the problem is just that I’m drastically underconfident on trivia questions and shouldn’t be assigning less than 50% probability to any of my answers when I have an answer, that sounds sufficiently unrepresentative of most areas where you need calibration, and how most people perform on other calibration tests, for this to be a pretty bad measure of calibration.
Perhaps it would be better as a multiple choice test, so one can have possible answers raised to attention that may or may not be right, and assign probabilities to those?
My favorite calibration tools have been one where there was a numerical answer and you had to express a 50% confidence interval, or 90% confidence interval.
Like, a question would be how many stairs are there in the Statue of Liberty? And my 50% interval would be 400-1000, and my 90% interval would be 200-5000.
Looking up the answer it was 354, and I would mark my 50% as wrong and my 90% as right.
0% probability is my most common answer as well, but I’m using it less often than I was choosing 50% on the CFAR calibration app (which forces a binary answer choice rather than an open-ended answer choice). The CFAR app has lots of questions like “Which of these two teams won the Superbowl in 1978” where I just have no idea. The trivia database Nanashi is using has, for me, a greater proportion of questions on which my credence is something more interesting than an ignorance prior.
That’s a fair criticism, but if we’re going down this road we’ve also gotta recognize the limitations of a multiple choice calibration test. Both styles suffer from the “You know it or you don’t” dichotomy. If these questions were all multiple choice, you’d still have the same rectangular shaped graph, it would just start at 50% (assuming a binary choice) instead of 0%.
The big difference is the solution sets that the different styles represent. There are plenty of situations in life where there are a few specific courses of action to choose from. But there are also plenty of situations where that’s not the case.
But, I will say that a multiple choice test definitely yields a “pretty” calibration curve much faster than an open-ended test. You’ve got a smaller range of values, and the nature of it lets you more confidently rule out one answer or the other. So the curve will be smoother faster. Whereas this will be pretty bottom heavy for a while.
That means that for those questions most probabilities are either close to 0 or close to 1. This suggests that given this set of questions it would probably be a good idea to increase “resolution” near those two points. For that purpose, perhaps instead of asking for confidence levels expressed as percentages you could ask for confidence levels expressed as odds or log odds. For example, users could express their confidence levels using odds expressed as ratios 2^n:1, for n=k,...,0,...,-k.
That’s an interesting thought but I do suspect that you’d have to answer a lot of questions to see any difference whatsoever. If you’re perfectly calibrated and answer 100 questions that you are either 99.99% confident or 99.9% confident, there’s a very good chance that you’ll get all 100 questions right, regardless of which confidence level you pick.