Calibration Test with database of 150,000+ questions
Hi all,
I put this calibration test together this morning. It pulls from a trivia API of over 150,000 questions so you should be able to take this many, many times before you start seeing repeats.
http://www.2pih.com/caltest.php
A few notes:
1. The questions are “Jeopardy” style questions so the wording may be strange, and some of them might be impossible to answer without further context. On these just assign 0% confidence.
2. As the questions are open-ended, there is no answer-checking mechanism. You have to be honest with yourself as to whether or not you got the right answer. Because what would be the point of cheating at a calibration test?
I can’t think of anything else. Please let me know if there are any features you would want to see added, or if there are any bugs, issues, etc.
**EDIT**
As per suggestion I have moved this to the main section. Here are the changes I’ll be making soon:
Label the axes and include an explanation of calibration curves.
Make it so you can reverse your last selection in the event of a misclick.
Here are changes I’ll make eventually:
Create an account system so you can store your results online.
Move trivia DB over to my own server to allow for flagging of bad/unanswerable questions.
Here are the changes that are done:
Change 0% to 0.1% and 99% to 99.9%
Added second graph which shows the frequency of your confidence selections.
Color code the “right” and “wrong” buttons and make them farther apart to prevent misclicks.
Store your results locally so that you can see your calibration over time.
Check to see if a question is blank and skip if so.
- 16 types of useful predictions by 10 Apr 2015 3:31 UTC; 169 points) (
- LW4EA: 16 types of useful predictions by 24 May 2022 3:19 UTC; 14 points) (EA Forum;
- 27 May 2015 20:48 UTC; 3 points) 's comment on The most important meta-skill by (
I think the problem here is with many trivia questions you either know the answer or you don’t; the dominant factor in my results so far is that I either have no answer in mind, assign 0 probability to my being right and am correctly calibrated there, and then all of my answers at other levels of certainty have turned out right so far so my calibration curve looks almost rectangular.
I might just be getting accurate information that I’m drastically underconfident, but I think this might be one of the worse types of questions to calibrate on. I mean, even if the problem is just that I’m drastically underconfident on trivia questions and shouldn’t be assigning less than 50% probability to any of my answers when I have an answer, that sounds sufficiently unrepresentative of most areas where you need calibration, and how most people perform on other calibration tests, for this to be a pretty bad measure of calibration.
Perhaps it would be better as a multiple choice test, so one can have possible answers raised to attention that may or may not be right, and assign probabilities to those?
My favorite calibration tools have been one where there was a numerical answer and you had to express a 50% confidence interval, or 90% confidence interval.
Like, a question would be how many stairs are there in the Statue of Liberty? And my 50% interval would be 400-1000, and my 90% interval would be 200-5000.
Looking up the answer it was 354, and I would mark my 50% as wrong and my 90% as right.
0% probability is my most common answer as well, but I’m using it less often than I was choosing 50% on the CFAR calibration app (which forces a binary answer choice rather than an open-ended answer choice). The CFAR app has lots of questions like “Which of these two teams won the Superbowl in 1978” where I just have no idea. The trivia database Nanashi is using has, for me, a greater proportion of questions on which my credence is something more interesting than an ignorance prior.
That’s a fair criticism, but if we’re going down this road we’ve also gotta recognize the limitations of a multiple choice calibration test. Both styles suffer from the “You know it or you don’t” dichotomy. If these questions were all multiple choice, you’d still have the same rectangular shaped graph, it would just start at 50% (assuming a binary choice) instead of 0%.
The big difference is the solution sets that the different styles represent. There are plenty of situations in life where there are a few specific courses of action to choose from. But there are also plenty of situations where that’s not the case.
But, I will say that a multiple choice test definitely yields a “pretty” calibration curve much faster than an open-ended test. You’ve got a smaller range of values, and the nature of it lets you more confidently rule out one answer or the other. So the curve will be smoother faster. Whereas this will be pretty bottom heavy for a while.
That means that for those questions most probabilities are either close to 0 or close to 1. This suggests that given this set of questions it would probably be a good idea to increase “resolution” near those two points. For that purpose, perhaps instead of asking for confidence levels expressed as percentages you could ask for confidence levels expressed as odds or log odds. For example, users could express their confidence levels using odds expressed as ratios 2^n:1, for n=k,...,0,...,-k.
That’s an interesting thought but I do suspect that you’d have to answer a lot of questions to see any difference whatsoever. If you’re perfectly calibrated and answer 100 questions that you are either 99.99% confident or 99.9% confident, there’s a very good chance that you’ll get all 100 questions right, regardless of which confidence level you pick.
Awesome!
I’ve been dying for something like this after I zoomed through all the questions in the CFAR calibration app.
Notes so far:
The highest-available confidence is 99%, so the lowest-available confidence should be 1% rather than 0%. Or even better, you could add 99.9% and 0.1% as additional options.
So far I’ve come across one question that was blank. It just said Category: jewelry and then had no other text. Somehow the answer was Ernest Hemingway.
Would be great to be able to sign up for an account so I could track my calibration across multiple sessions.
Re: 0%, that’s fair. Originally I included 0% because certain questions are either unanswerable (due to being blank, contextless, or whatnot) but even then there’s still a non-zero possibility of guessing the right answer out of a near-infinite number of choices.
Re: Calibration across multiple sessions. Good idea. I’ll start with a local-based solution because that would be easiest and then eventually do an account-based thing.
Re: Blank questions. Yeah, I should probably include some kind of check to see if the question is blank and skip it if so.
Thanks! BTW, I’d prefer to have 1% and 0.1% and 99% and 99.9% as options, rather than skipping over the 1% and 99% options as you have it now.
I considered that but I think at least for now it may just overcomplicate things for not a ton of benefit. Subjectively it seems that out of 100 questions, there are maybe 10 that I would assign the highest possible confidence. Of those I’d say only 1 out them would be questions that I’d pick 99% confidence if it were available instead of, say, 99.9%.
So assuming (incorrectly) that I’m perfectly calibrated it would take about 7000 questions in order to stand a >50% chance of seeing a meaningful difference between the two confidence levels.
It’s possible to be, to some extent, certain that you haven’t thought of a correct answer (if not certain you don’t know the answer), because you don’t have any answer in mind and yet are not considering the answer “this is a trick question” or “there is no correct answer”. Is this something that should be represented, making “0%” correct to include, or am I confused?
I got one blank question, which I think was an error with loading since the answer came up the same as the previous question, and the one after it took a couple seconds to appear on-screen.
I’d prefer not to allow 0 and 1 as available credences. But if 0 remained as an option I would just interpret it as “very close to 0” and then keep using the app, though if a future version of the app showed me my Bayes score then the difference between what the app allows me to choose (0%) and what I’m interpreting 0 to mean (“very close to 0″) could matter.
I think it’s misleading to just drop in the statement that 0 and 1 are not probabilities.
There is a reasonable and arguably better definition of probabilities which excludes them, but it’s not the standard one, and it also has costs—for example probabilities are a useful tool in building models, and it is sometimes useful to use probabilities 0 and 1 in models.
(aside: it works as a kind of ‘clickbait’ in the original article title, and Eliezer doesn’t actually make such a controversial statement in the post, so I’m not complaining about that)
Fair enough. I’ve edited my original comment.
(For posterity: the text for my original comment’s first hyperlink originally read “0 and 1 are not probabilities”.)
Perfect, thanks!
Thanks for providing this!
I have a worry about using trivia questions for calibration: there’s a substantial selection effect in the construction of trivia questions, so you’re much more likely to get an obscure question pointing to a well-known answer than happens by chance. The effect may be to calibrate people for trivia questions in a way that transfers poorly to other questions.
I think this should be copied/moved to Main. A calibration tool certainly deserves wider circulation.
A nice feature would be to mark a question as unanswerable, and if it gets enough flags you could overview and delete them. I just recently came across a question which asked what can be seen in “this image”. Without any image attached, of course.
It would probably be best to just remove all questions that contain certain key phrases like “this image” or “seen here”. You’ll get a few false positives but with such a big database that’s no great loss.
Interesting idea, thanks for doing it, but saddly many questions are very US-centric. It would be nice to have some “tags” on the questions, and let the users select which kind of questions he wants (for example the non-US people could remove the US-specific ones).
Should it be specified (or should the answerer tell) whether we are from the USA or not? A lot of questions seem to be very USA-centric, so the confidence can heavily depend on whether we live in the USA or not.
Note, website seems broken now. Still loads, but the questions don’t, and there is only 1 question without an answer.
I made a major update to the interface to make it look prettier. I’ve tested this in Chrome but please let me know if it doesn’t work in any other OS or browser.
I also added Google Analytics so I can see where people are accessing this from.
A link or button to flip your last right/wrong would be nice. I had assigned 0% confidence for one question and accidentally said I got it right. Misclicks aren’t the same as poor calibration.
Also, a little more on how to use it would make sense—the first one or two I did, I thought it was, ‘how confident are you that this assertion is true’ and I thought it was very oddly phrased. Then I realized.
Got it. I’ll make them color coded and farther apart.
I’ll write some better instructions as well.
What would help most is: “Pick an answer. How confident are you that your answer is correct?”
Then, make sure that when the user clicks the ‘show answer’ button, make sure that neither of the two new buttons are in the same place.
ALSO, it would be nice if the calibration curve showed the credible interval for each bin, so I can tell at a glance that my getting 1⁄1 right at 30% and 0⁄1 right at 60% isn’t actually that big a hit to my calibration.
And if the second graph was stacked so that I don’t have this giant red bar at 100%, which just looks odd. If it was red behind/on-top-of green, that would make the most sense (if stacked on top, you will obviously need to take the difference to maintain the sense of the graph).
Do you intend to curate out questions that are impossible/require additional clarifications like Alex would have given in advance or people would have worked out from the easy ones?
Great tool. Does the API you’re using allow unanswerable questions to be flagged at all though? Just got one question that depended on an image that wasn’t there, and another with no question body. Also, labeled axes on the graph might be nice for people who don’t already know how calibration curves work and/or don’t like unlabeled axes.
It does but only if you’re hosting the API database on your own server. Which I will probably do sooner rather than later. I might implement a “skip” option for unanswerable questions. But selecting 0% does the same thing pretty much.
I noticed such thing that if previous question was estimated to have 60% confidence and the next one is supposed to be 70% I sometimes leave it as 60 due to laziness.
Well, I’m getting a reasonably exciting calibration curve with lots of ups and downs. Cool!
Bug: when I click “Display Calibration Curve” for a second time, the graph is displayed in a larger size. (Doing this sufficiently many times crashed Chrome.) Refreshing the page fixes this behavior.
Feature request: I would like to be able to see if my 50% correctness for 30% confidence is getting 1 out of 2 questions right or 5 out of 10. (Error bars of some sort would also work.)
Good idea. I don’t think the charts API I’m using will let me do error bars but a good alternative would be a secondary chart that’s a bar graph of right vs total questions for each bucket. This would also give a good visual representation of the frequency with which you use various confidence levels.
You rock!