I decided to take a look at overconfidence (rather than calibration) on the 10 calibration questions.
For each person, I added up the probabilities that they assigned to getting each of the 10 questions correct, and then subtracted the number of correct answers. Positive numbers indicate overconfidence (fewer correct answers than they predicted they’d get), negative numbers indicate underconfidence (more correct answers than they predicted). Note that this is somewhat different from calibration: you could get a good score on this if you put 40% on each question and get 40% of them right (showing no ability to distinguish between what you know and what you don’t), or if you put 99% on the ones you get wrong and 1% on the ones you get right. But this overconfidence score is easy to calculate, has a nice distribution, and is informative about the general tendency to be overconfident.
After cleaning up the data set in a few ways (which I’ll describe in a reply to this comment), the average overconfidence score was 0.39. On average, people expected to get 4.79 of the 10 questions correct, but only got 4.40 correct. My impression is that this gap (4 percentage points) is smallish compared to what overconfidence research tends to find, but I don’t have any numbers at hand to make direct comparisons with the numbers in the published literature.
People were most overconfident on question 6 (densest planet: 18% correct, 35% average estimate) and question 10 (bestselling video game: 7% correct, 22% average estimate) and most underconfident on questions 4 (Norse God: 87% correct, 75% average estimate) and 2 (Obama’s state: 82% correct, 71% average estimate).
Overconfidence was correlated with a few other variables at p<.01:
That is, people who were more overconfident had lower test scores, assigned a higher probability to cryonics working, and were more in favor of raising the minimum wage. On PCryonics, I think my comments about the cryonics questions on the 2011 Survey are related to what’s going on.
Overconfidence had no significant relationships with any of the other numerical variables, including the various other probability estimates and political views, age, finger ratio, or charitable donations. It was also uncorrelated with scales measuring growth mindset and self-efficacy.
When I turned them into numbers, various measures of ties to the LW community were correlated with overconfidence in the expected direction (closer ties to LW --> less overconfident), but not at p < .01 (perhaps in part because they weren’t really intended to be continuous variables). So I combined several questions about ties to LW into a simple composite variable where a person gets one point each for: having read the sequences, having joined the community before 2010, having at least 1000 karma, having read all of HPMOR, having attended a full CFAR workshop, having posted in main, regularly attending meetups, regularly interacting with LWers in person, and having a romantic partner that they met through LW. This composite variable (which ranged from 0 to 8) correlated with overconfidence at r = -.085, p < .01. In other words: people with closer ties to LW were less overconfident.
But it’s probably more informative to compare means on these variables, instead of turning them into an ad hoc continuous variable. Here is the average overconfidence score among various subgroups (where the full sample was overconfident by 0.39 questions out of ten, etc.):
Everyone 0.39 (4.79 pred − 4.40 actual) (n=1141) Read HPMOR 0.35 (4.70 pred − 4.35 actual) (n=753) Active in-person 0.26 (4.55 pred − 4.29 actual) (n=171) Read the sequences 0.23 (4.69 pred − 4.46 actual) (n=357) Attended CFAR 0.15 (4.42 pred − 4.27 actual) (n=91) High test scores 0.15 (5.14 pred − 4.99 actual) (n=260) 1000 karma 0.14 (4.98 pred − 4.83 actual) (n = 127)
(The active in-person group includes everyone who answered yes/regularly/all the time to any of the 3 questions: in-person interaction, attending meetups, or LW romantic partner. The high test scores group includes anyone who was in the top 25% of reported scores on any one of the 4 test score questions: IQ (146+), SAT out of 1600 (1560+), SAT out of 2400 (2330+), or ACT (35+).)
Compared to the full sample (which was overconfident by 0.39 questions), there was less than half as much overconfidence among people who attended CFAR, have 1000 karma, or have high test scores. Other indicators of LW involvement were also associated with less overconfidence, though with smaller effect sizes.
Note that being overconfident by 0.14 questions is a small enough gap to be accounted for entirely by a single one of the 10 questions. If we remove the video game question, for example, then the people with 1000+ karma are within 0.01 question of being neither overconfident nor underconfident. So these results are consistent with the 1000+ karma group being perfectly calibrated (although they still count as some evidence in favor of that group being a bit overconfident).
In summary: LWers show some overconfidence, probably less overconfidence than in the published literature, and there’s less overconfidence among those with close ties to LW (e.g., high karma or CFAR alumni) or with high test scores. Pretty similar to what I found for other biases on the 2012 LW Survey.
In the publicly available data set, I restricted my analysis to people who:
entered a number on each of the 10 calibration probability estimates
did not enter any estimates larger than 100
entered at least one estimate larger than 1
entered something on each of the 10 calibration guesses
did not enter a number for any of the 10 calibration guesses
Failure to meet any of these criteria generally indicated either a failure to understand the format of the calibration questions, or a decision to skip one or more of the questions. Each of these criteria eliminated at least 1 person, leaving a sample of 1141 people.
I counted as “correct”:
any answer which Scott/Ozy counted as correct
any answer to question 1 (largest bone) which began with “fem” (e.g., “femer”)
any answer to question 2 (Obama’s state) which began with “haw” (e.g., “Hawii”)
any answer to question 4 (Norse god) which began with “od” or “wo” (e.g., “Wotan”)
any answer to question 8 (cell) which began with “mito” (e.g., “Mitochondira”)
These seem to cover the most common misspellings (or alternate names, e.g. “Wotan” is the German name for Odin), while counting very few obviously wrong answers as correct, and without having to go through every answer one by one. Counting these answers gave the average participant another 0.15 correct answers, and I suspect we could add another 0.05 or so by going through answer by answer with lenient standards. The mitochondria leniency made the largest difference, adding 97 correct answers.
Without counting these additional correct answers, the average overconfidence score would have been 0.54 among the full sample, 0.40 among sequence readers, 0.32 among CFAR alumni, 0.40 among active-in-person LWers, 0.30 among those with 1000 karma, and 0.23 among those with high test scores. Counting these additional correct answers helped non-US LWers more than US LWers (by 0.21 questions vs. 0.11); I suspect that part of that is due to spelling difficulties for non-native speakers and part is due to the Odin vs. Wotan thing.
If a person was perfectly calibrated, then each 10% increase in their probability estimate would translate into a 10% higher likelihood of getting the answer correct. If you plot probability estimates on the x axis and whether or not the event happened on the y axis, then you should get a slope of 1 (the line y=x). But people tend to be miscalibrated—out of the questions where they say “90%”, they might only get 70% correct. This results in a shallower slope (in this example, the line would go through the point (90,70) instead of (90,90)) - a slope less than 1.
I took the 1141 people’s answers to the 10 calibration questions as 11410 data points, plotted them on an x-y graph (with the probability estimate as the x value and a y value of 100 if it’s correct and 0 if it’s incorrect), and ran an ordinary linear regression to find the slope of the line fit to all 11410 data points.
That line had a slope of 0.91. In other words, if a LWer gave a probability estimate that was 10 percentage points higher, then on average the claim was 9.1 percentage points more likely to be true. Not perfect calibration, but not bad.
If we look at various subsets of LWers on the survey, here are the slopes that we get:
0.91 Everyone 0.92 Read HPMOR 0.92 1000 karma 0.93 High test scores 0.93 Read the sequences 0.96 Active in-person 0.96 Attended CFAR
I haven’t done any tests of statistical significance, but all of these more LWy subgroups do have slopes that are higher (and closer to the well-calibrated slope of 1) than the slope for the full sample (as do the people with high scores on SAT/ACT/IQ tests).
I decided to take a look at overconfidence (rather than calibration) on the 10 calibration questions.
For each person, I added up the probabilities that they assigned to getting each of the 10 questions correct, and then subtracted the number of correct answers. Positive numbers indicate overconfidence (fewer correct answers than they predicted they’d get), negative numbers indicate underconfidence (more correct answers than they predicted). Note that this is somewhat different from calibration: you could get a good score on this if you put 40% on each question and get 40% of them right (showing no ability to distinguish between what you know and what you don’t), or if you put 99% on the ones you get wrong and 1% on the ones you get right. But this overconfidence score is easy to calculate, has a nice distribution, and is informative about the general tendency to be overconfident.
After cleaning up the data set in a few ways (which I’ll describe in a reply to this comment), the average overconfidence score was 0.39. On average, people expected to get 4.79 of the 10 questions correct, but only got 4.40 correct. My impression is that this gap (4 percentage points) is smallish compared to what overconfidence research tends to find, but I don’t have any numbers at hand to make direct comparisons with the numbers in the published literature.
People were most overconfident on question 6 (densest planet: 18% correct, 35% average estimate) and question 10 (bestselling video game: 7% correct, 22% average estimate) and most underconfident on questions 4 (Norse God: 87% correct, 75% average estimate) and 2 (Obama’s state: 82% correct, 71% average estimate).
Overconfidence was correlated with a few other variables at p<.01:
SATscoresoutof2400 -.185 (242)
SATscoresoutof1600 -.160 (329)
IQ -.157 (368)
PCryonics .116 (1112)
MinimumWage .086 (1055)
That is, people who were more overconfident had lower test scores, assigned a higher probability to cryonics working, and were more in favor of raising the minimum wage. On PCryonics, I think my comments about the cryonics questions on the 2011 Survey are related to what’s going on.
Overconfidence had no significant relationships with any of the other numerical variables, including the various other probability estimates and political views, age, finger ratio, or charitable donations. It was also uncorrelated with scales measuring growth mindset and self-efficacy.
When I turned them into numbers, various measures of ties to the LW community were correlated with overconfidence in the expected direction (closer ties to LW --> less overconfident), but not at p < .01 (perhaps in part because they weren’t really intended to be continuous variables). So I combined several questions about ties to LW into a simple composite variable where a person gets one point each for: having read the sequences, having joined the community before 2010, having at least 1000 karma, having read all of HPMOR, having attended a full CFAR workshop, having posted in main, regularly attending meetups, regularly interacting with LWers in person, and having a romantic partner that they met through LW. This composite variable (which ranged from 0 to 8) correlated with overconfidence at r = -.085, p < .01. In other words: people with closer ties to LW were less overconfident.
But it’s probably more informative to compare means on these variables, instead of turning them into an ad hoc continuous variable. Here is the average overconfidence score among various subgroups (where the full sample was overconfident by 0.39 questions out of ten, etc.):
Everyone 0.39 (4.79 pred − 4.40 actual) (n=1141)
Read HPMOR 0.35 (4.70 pred − 4.35 actual) (n=753)
Active in-person 0.26 (4.55 pred − 4.29 actual) (n=171)
Read the sequences 0.23 (4.69 pred − 4.46 actual) (n=357)
Attended CFAR 0.15 (4.42 pred − 4.27 actual) (n=91)
High test scores 0.15 (5.14 pred − 4.99 actual) (n=260)
1000 karma 0.14 (4.98 pred − 4.83 actual) (n = 127)
(The active in-person group includes everyone who answered yes/regularly/all the time to any of the 3 questions: in-person interaction, attending meetups, or LW romantic partner. The high test scores group includes anyone who was in the top 25% of reported scores on any one of the 4 test score questions: IQ (146+), SAT out of 1600 (1560+), SAT out of 2400 (2330+), or ACT (35+).)
Compared to the full sample (which was overconfident by 0.39 questions), there was less than half as much overconfidence among people who attended CFAR, have 1000 karma, or have high test scores. Other indicators of LW involvement were also associated with less overconfidence, though with smaller effect sizes.
Note that being overconfident by 0.14 questions is a small enough gap to be accounted for entirely by a single one of the 10 questions. If we remove the video game question, for example, then the people with 1000+ karma are within 0.01 question of being neither overconfident nor underconfident. So these results are consistent with the 1000+ karma group being perfectly calibrated (although they still count as some evidence in favor of that group being a bit overconfident).
In summary: LWers show some overconfidence, probably less overconfidence than in the published literature, and there’s less overconfidence among those with close ties to LW (e.g., high karma or CFAR alumni) or with high test scores. Pretty similar to what I found for other biases on the 2012 LW Survey.
Details on data cleanup:
In the publicly available data set, I restricted my analysis to people who:
entered a number on each of the 10 calibration probability estimates
did not enter any estimates larger than 100
entered at least one estimate larger than 1
entered something on each of the 10 calibration guesses
did not enter a number for any of the 10 calibration guesses
Failure to meet any of these criteria generally indicated either a failure to understand the format of the calibration questions, or a decision to skip one or more of the questions. Each of these criteria eliminated at least 1 person, leaving a sample of 1141 people.
I counted as “correct”:
any answer which Scott/Ozy counted as correct
any answer to question 1 (largest bone) which began with “fem” (e.g., “femer”)
any answer to question 2 (Obama’s state) which began with “haw” (e.g., “Hawii”)
any answer to question 4 (Norse god) which began with “od” or “wo” (e.g., “Wotan”)
any answer to question 8 (cell) which began with “mito” (e.g., “Mitochondira”)
These seem to cover the most common misspellings (or alternate names, e.g. “Wotan” is the German name for Odin), while counting very few obviously wrong answers as correct, and without having to go through every answer one by one. Counting these answers gave the average participant another 0.15 correct answers, and I suspect we could add another 0.05 or so by going through answer by answer with lenient standards. The mitochondria leniency made the largest difference, adding 97 correct answers.
Without counting these additional correct answers, the average overconfidence score would have been 0.54 among the full sample, 0.40 among sequence readers, 0.32 among CFAR alumni, 0.40 among active-in-person LWers, 0.30 among those with 1000 karma, and 0.23 among those with high test scores. Counting these additional correct answers helped non-US LWers more than US LWers (by 0.21 questions vs. 0.11); I suspect that part of that is due to spelling difficulties for non-native speakers and part is due to the Odin vs. Wotan thing.
And here’s an analysis of calibration.
If a person was perfectly calibrated, then each 10% increase in their probability estimate would translate into a 10% higher likelihood of getting the answer correct. If you plot probability estimates on the x axis and whether or not the event happened on the y axis, then you should get a slope of 1 (the line y=x). But people tend to be miscalibrated—out of the questions where they say “90%”, they might only get 70% correct. This results in a shallower slope (in this example, the line would go through the point (90,70) instead of (90,90)) - a slope less than 1.
I took the 1141 people’s answers to the 10 calibration questions as 11410 data points, plotted them on an x-y graph (with the probability estimate as the x value and a y value of 100 if it’s correct and 0 if it’s incorrect), and ran an ordinary linear regression to find the slope of the line fit to all 11410 data points.
That line had a slope of 0.91. In other words, if a LWer gave a probability estimate that was 10 percentage points higher, then on average the claim was 9.1 percentage points more likely to be true. Not perfect calibration, but not bad.
If we look at various subsets of LWers on the survey, here are the slopes that we get:
0.91 Everyone
0.92 Read HPMOR
0.92 1000 karma
0.93 High test scores
0.93 Read the sequences
0.96 Active in-person
0.96 Attended CFAR
I haven’t done any tests of statistical significance, but all of these more LWy subgroups do have slopes that are higher (and closer to the well-calibrated slope of 1) than the slope for the full sample (as do the people with high scores on SAT/ACT/IQ tests).