I’m going to post it to it’s own thread when it’s totally complete (the analysis itself is done but I just don’t like the way the sections are organized.) You can ask me any questions you have. I’d be very interested in any suggestions to make it better or less confusing.
First, on the Hamilton question, it’s not at all clear that the official answer is the correct one given the question. A very reasonable reading of the question would restrict the scope to items currently in circulation. Looking at the question, even knowing you meant ‘ever’, it feels like an unnatural way of asking. If you clarified, then I at least would have answered higher and with less confidence; and I suspect that almost everyone else would have as well.
Second, there are several questions which had a numeric range—like, say, the Dairy Queen question. Reducing to binary right-wrong seems needlessly lossy. If you convert the confidence of being within 10 years into an expected error on a Gaussian distribution, you can plot the actual deviations vs the expected deviations.
Third, I am trying to figure out if one can do something with the automatic probabilities-to-odds generation scheme discussed… somewhere here. I can’t find it. Basically, you’d pair up people who were right on the same questions and see who would win, betting against each other based on their probabilities in such a way that they each expect to win the same amount. Only works between people who shared answers, though. Should be generalizable to people with similar levels of accuracy, and may be generalizable to people with different levels of accuracy.
I didn’t design the questions and those are the official answers. And it does seems correct to me, that it should include all bills ever printed and not just those currently being printed.
I’m really not sure how to do your second point. I could fit all the answers into a normal distribution sure, but what information does that give me for any specific individual? It doesn’t really tell me what their true probability of getting the question correct was, which I can already get from the percent of people that answered each question correctly.
The third idea is interesting, comparing people who got the same number of answers right. But it still does reward luck and prior knowledge. As I showed, people have indistinguishable probabilities of getting each question right, all that differs is how overconfident or underconfident they are.That model seems to produce the best correlations as well.
Agree with Luke about the Hamilton questions. I read it about current ones. If it meant “has appeared on”, it should say “has appeared on”, not “appears on”. While certainly the latter can be read as all the ones he’s ever appeared on, the more natural interpretation to me means those currently being printed.
You can probably get some idea to what extent it was interpreted this way by looking at the size of the answers. I’d say, if we assume people have some idea how American currency works, then 0-1 probably indicates a “present” interpretation, 3 or more will almost always indicate an “ever” interpretation, and 2 is hard to tell from. But that is assuming people have some idea of how American currency works.
Alexander Hamilton appears on how many distinct denominations of US Currency
It’s the present tense that throws me. I would expect the question to be ‘has appeared’. Whatever.
2) That’s not what I meant. I mean, you can turn each individual person’s prediction/probability pair into a gaussian curve. It is centered on their answer with width such that a 10 year window contains that much probability. You can then use that to get the probability this distribution—and thus, by proxy, the respondent—assigns to the actual year.
3) On such a small data set you can’t get rid of luck, let alone differences in knowledge. I think that by picking out people who got the same number correct you do a pretty good job of de-confounding that. It cuts sideways across the bins in the ‘mean % correct’ vs ‘mean % confidence’ graph which showed flat performance across confidence, in a way that you can’t do straightforwardly otherwise.
Is there a thread for the calibration question analysis? I have some questions and comments about that, more than this.
I’m going to post it to it’s own thread when it’s totally complete (the analysis itself is done but I just don’t like the way the sections are organized.) You can ask me any questions you have. I’d be very interested in any suggestions to make it better or less confusing.
First, on the Hamilton question, it’s not at all clear that the official answer is the correct one given the question. A very reasonable reading of the question would restrict the scope to items currently in circulation. Looking at the question, even knowing you meant ‘ever’, it feels like an unnatural way of asking. If you clarified, then I at least would have answered higher and with less confidence; and I suspect that almost everyone else would have as well.
Second, there are several questions which had a numeric range—like, say, the Dairy Queen question. Reducing to binary right-wrong seems needlessly lossy. If you convert the confidence of being within 10 years into an expected error on a Gaussian distribution, you can plot the actual deviations vs the expected deviations.
Third, I am trying to figure out if one can do something with the automatic probabilities-to-odds generation scheme discussed… somewhere here. I can’t find it. Basically, you’d pair up people who were right on the same questions and see who would win, betting against each other based on their probabilities in such a way that they each expect to win the same amount. Only works between people who shared answers, though. Should be generalizable to people with similar levels of accuracy, and may be generalizable to people with different levels of accuracy.
I didn’t design the questions and those are the official answers. And it does seems correct to me, that it should include all bills ever printed and not just those currently being printed.
I’m really not sure how to do your second point. I could fit all the answers into a normal distribution sure, but what information does that give me for any specific individual? It doesn’t really tell me what their true probability of getting the question correct was, which I can already get from the percent of people that answered each question correctly.
The third idea is interesting, comparing people who got the same number of answers right. But it still does reward luck and prior knowledge. As I showed, people have indistinguishable probabilities of getting each question right, all that differs is how overconfident or underconfident they are.That model seems to produce the best correlations as well.
Agree with Luke about the Hamilton questions. I read it about current ones. If it meant “has appeared on”, it should say “has appeared on”, not “appears on”. While certainly the latter can be read as all the ones he’s ever appeared on, the more natural interpretation to me means those currently being printed.
You can probably get some idea to what extent it was interpreted this way by looking at the size of the answers. I’d say, if we assume people have some idea how American currency works, then 0-1 probably indicates a “present” interpretation, 3 or more will almost always indicate an “ever” interpretation, and 2 is hard to tell from. But that is assuming people have some idea of how American currency works.
1) Really?
It’s the present tense that throws me. I would expect the question to be ‘has appeared’. Whatever.
2) That’s not what I meant. I mean, you can turn each individual person’s prediction/probability pair into a gaussian curve. It is centered on their answer with width such that a 10 year window contains that much probability. You can then use that to get the probability this distribution—and thus, by proxy, the respondent—assigns to the actual year.
3) On such a small data set you can’t get rid of luck, let alone differences in knowledge. I think that by picking out people who got the same number correct you do a pretty good job of de-confounding that. It cuts sideways across the bins in the ‘mean % correct’ vs ‘mean % confidence’ graph which showed flat performance across confidence, in a way that you can’t do straightforwardly otherwise.