You’re probably overestimating how well you understand Dunning-Kruger
Content Note: Trolling
I
The popular conception of Dunning-Kruger is something along the lines of “some people are too dumb to know they’re dumb, and end up thinking they’re smarter than smart people”. This version is popularized in endless articles and videos, as well as in graphs like the one below.
Except that’s wrong.
II
The canonical Dunning-Kruger graph looks like this:
Notice that all the dots are in the right order: being bad at something doesn’t make you think you’re good at it, and at worst damages your ability to notice exactly how incompetent you are. The actual findings of professors Dunning and Kruger are more consistent with “people are biased to think they’re moderately above-average, and update away from that bias based on their competence or lack thereof, but they don’t update hard enough”. This results in people in the bottom decile thinking ‘I might actually be slightly below-average’, and people in the top percentile thinking ’I might actually be in the top 10%”, but there’s no point where the slope inverts.
Except that’s wrong.
III
I didn’t technically lie to you, for what it’s worth. I said it’s what the canonical Dunning-Kruger graph looks like, and it is.
However, the graph in the previous section was the result of a simulation I coded in a dozen lines of Python, using the following ruleset:
Elves generate M + 1d20 − 1d20 units of mana in a day.
M varies between elves.
If you ask an elf how much mana they’ll generate, they’ll consistently say M+5; a slight overestimate, the size of which is consistent for values of M.
I asked my elves what they expect to output, grouped them by decile of actual output, and plotted their predictions vs their actual output: the result was a perfect D-K graph.
If you don’t already know how this happened, I invite you to pause and consider for five minutes before revealing the answer.
The quantiles are ranked by performance post-hoc, so elves who got lucky on this test will be overrepresented in the higher deciles, and elves who got unlucky will be overrepresented in the lower deciles. (Yes, this is another Leakage thing.)
You can see the same effect even more simply with Christmas Elves, who don’t systematically overestimate themselves: when collected in quantiles, it looks like the competent ones are underconfident and the incompetent ones are overconfident, even though we can see from the code that they all perfectly predict their own average performance.
And, just to hammer the point home, you can also see it in a simulated study of some perfectly-calibrated people’s perceived vs actual guessing-whether-a-fair-coin-will-land-heads ability.
The original Dunning-Kruger paper doesn’t correct for this, and neither do most of its replications. Conversely, a recent and heavily-cited study which does correct for this finds no statistically significant residual Dunning-Kruger effects post-correction. So the thing that’s actually going on is “people are slightly overconfident; distinctly, there’s a statistical mirage that causes psychologists to incorrectly believe incompetence causes overconfidence; there’s no such thing as Dunning-Kruger”.
Except that’s wrong.
IV
. . . or, at least, incomplete. To start with, the specific study I linked you to has some pretty egregious errors, which are pulled apart here.
But even if it were planned and executed perfectly, “no statistically significant residual effects” is more a fact about sample size than reality: everything in a field as impure as Psychology correlates except for the things which anti-correlate, so you’ll eventually get p<0.05 (or p<0.0005, or whatever threshold you like) from any two variables you care to measure if you just study a large enough group of people.
But even if the study were a knockdown proof that D-K effects had a negligible or negative aggregate impact . . . “some people are too dumb to know they’re dumb, and end up thinking they’re smarter than smart people” is just obviously true. It’s obviously true because it’s an assertion which
A) begins with “some people”,
B) describes human minds, and
C) doesn’t break the laws of physics, biology, causality, or information theory.
(There are over eight billion of us now. It’s pretty hard to come up with possible things some people’s minds don’t do.)
The relevant question – especially for someone aiming to become more rational – isn’t “is this real?”, but “what’s the effect size?”, “how does it vary across populations?”, “how can I tell if it affects me?” and “is there another effect pushing in the opposite direction?”.
This all applies to the non-pop-sci version too. “I don’t know much about this, so I’ll falsely assume I understand a larger fraction of it than I do” is something you can probably recall having personal experience of, but so is “I don’t know much about this, so I’ll falsely assume the parts I don’t understand are incredibly impressive witchcraft to which it would be hubris for me to aspire”, and so is “I don’t know much about this, so I’ll falsely assume the parts I don’t understand are coherent and useful”; I can testify that I for one have been on both sides of all three of these at some point in my life.
Anyway, to sum up, my actual opinion: “there may or may not be a Dunning-Kruger effect in aggregate over any given group, but the original Dunning-Kruger paper and most of its replications make systematic statistical errors which render them useless; the original and pop-sci D-K effects are obviously true for some of the population some of the time but the same is true of any coherent psychology hypothesis including their exact opposites; miscalibration about competence still seems worth trying to fix but you’d need to check which mistake is being made.”
Except that’s wrong.
. . . probably, somehow, at least a little. I don’t know what specific mistake(s) I made, and look forward to finding out in the comments. I’m very confident in my statistical and epistemic arguments, but I’m painfully aware the non-simulated object-level sources for this post were a handful of internet articles I read plus two papers I skimmed. Caveat lector.
. . . unless I’m wrong about being wrong?
Spencer Greenberg (@spencerg) & Belen Cobeta at ClearerThinking.org have a more thorough and well-researched discussion at: Study Report: Is the Dunning-Kruger Effect real? (Also, their slightly-shorter blog post summary.)
This OP would mostly correspond to what ClearerThinking calls “noisy test of skill”. But ClearerThinking also goes through various other statistical artifacts impacting Dunning-Kruger studies, plus some of their own data analysis. Here’s (part of) their upshot:
It’s really counterproductive to do things like present a graph and then say “Except that’s wrong.” + “I didn’t technically lie to you, for what it’s worth. I said it’s what the canonical Dunning-Kruger graph looks like, and it is.”
I just don’t want to further read a post using these sort of tricks.
I have the opposite experience. It delights me and I enjoy digging in deeper.
People are different!
Me too, but that’s because I appreciate being “caught red-handed” believing what I’m reading. I see it as a favor done me by the author.
If you weren’t anyway in the mindset that you need to practice weighing the information that’s given to you and calibrating by guessing at the answer before actually reading the real answer, I suppose it could be annoying.
Scott Alexander uses this style sometimes, and I like it. However, he tends to do it once per essay. I think that can work very well. Here, though, after I hit the “that’s wrong” multiple times, it started to feel like nothing in the essay was worth trying to understand, since I expected what I was reading to later be proclaimed wrong. (Just my own feeling.)
Yes. The fact that this post is precisely about trying to deconfuse a pre-existing misconception makes it even more important to be crystal clear. It’s known to be hard to overwrite pre-existing misconceptions with the correct understanding, and I’m pretty sure this doesn’t help.
This seems like the sort of thing best addressed by me adding a warning / attention-conservation-notice at the start of the article, though I’m not sure what would be appropriate. “Content Note: Trolling”?
ETA: This comment has been up for 24 hours and it has positive agreement karma and no-one’s suggested a better warning to use, so I’m doing the thing. Hopefully this helps?
I fully understand how this format could be frustrating to some people. I personally loved it because each new step/graph made sense and taught me something that helped me understand the next one. There was such a feeling of invested discovery in reading this post that it lead to me reading it a second time.
Some writing styles don’t work for some people, but this one really worked for me.
Agreed. One should state the main finding in a TLDR/abstract or else I’ll ask chatgpt to write one for me.
Critique about essay structure and communication style:
Walking through statistical issues with papers on the D-K effect is an interesting topic. However, most readers (me included) aren’t intimately familiar with the details of this literature. Because of this the format of this essay, which is being tricky over and over again, makes it really hard for me to parse or learn about the D-K effect. This is an important issue, because you’re writing a blog post to critique peer reviewed literature, so you have to do extra work to build credibility by convincingly walking us through the flawed reasoning. Because of the “tricky” style of the essay, it simultaneously provokes me to want to understand the issue better and yet makes me not want to use this essay as a resource for building that understanding.
It seems like the essay is possibly meant less to be an explainer and more to string together links and reactions to the literature on this topic. However, it’s not obvious that that’s the case, and the essay contains a lot of very bold, definitive judgments on what’s right and wrong in the literature. These judgments are easy to grasp and remember, but the oblique reasoning process is not. Over time, I’ve come to really put my guard up against essays like this, because I don’t want my brain to get polluted by an aura of confident judgmental rhetoric with no true understanding of the underlying issue.
There’s no particular reason I’m flagging the issue on this essay in particular, BTW, but this is my primary reaction to it.
The original D-K papers also found different curves for different subject matter. And they made the unusual choice of dividing their populations into quartiles, throwing away quite a bit of resolution. What’s up with that?
I’ve no idea, but I think you should collaborate with someone named Duenning to find out.
I can think of several explanations for this, all of which
might be trueare definitely at least a little true:Some subjects have higher variance in performance, resulting in steeper D-K curves.
Some subjects have higher variance in test-ability-to-measure-performance, again resulting in steeper D-K curves.
An actual D-K effect does exist, sometimes, superposed over the statistical mirage; and it’s stronger for some subjects than others.
An anti-D-K effect exists, and it’s stronger for some subjects than others.
Something else is happening I don’t know about.
Doesn’t seem unusual to me ( . . . or suspicious, if that’s what you’re getting at). I get away with using deciles at my day job because I work on large datasets with low-variance data, and I get away with it here because I can just add zeroes to the number of elves simulated until my plots look as smooth as I want; Dunning & Kruger had a much smaller sample since they were studying college classes full of real round-eared human beings, and sensibly chose to bucket them into fewer buckets.
I interpret imprecise colloquial statements like these not as weak claims about existence of such people but as bolder claim that the effect is of practical importance in the challenges faced by many people.
Sample size and population effect size both factor into the likelihood of obtaining a statistically significant result. So this is a fact about both the experiment and the population, not just the experiment alone.
I’d say sample size is more important if any experiment can get any statistical significance with the right sample size but not any sample size can get any statistical significance with the right experiment. But you’re right, I overstated my case; amended; thank you.
Thanks for listening. I still think that this is a misleading statement.
If we are considering empirical experiments, then our approximation of samples as being iid and sampled with replacement may break down at relatively small sample sizes, invalidating fundamental assumptions of common statistical significance tests.
If we are considering a mathematical model of a random experiment, then when the null hypothesis is true, the probability of a Type I error remains fixed at the chosen level of alpha no matter the sample size.
Really interisting post, I learned a lot. I always assumed D-K came from a comparison of top experts and the general population on a super long test.
Short description of the studies in the paper
To those who haven’t seen it[1]: In the first 3 studies the subjects were 65, 45 and 84 undergraduate students from Cornell answering 30 tests on evaluating humor, 20 tests on logic or 20 tests on grammar, respectively. They were also asked to estimate their performance among their peers, with the mentioned results. The third study had a phase two, where bottom and top quartile students would come back to grade a sample of 5 tests from their peers (with the same mean and SD as the whole group, which they were informed of) and then asked to give a new estimate of their own performance.
I agree with your demonstration for the first studies. They were small questionaires and could be explained by chance (not going to calculate the odds). After doing the third test (20 questions on grammar), the bottom quartile predicted they would get 12.9 questions correct and got 9.2. The top quartile predicted 16.9 and got 16.4. This increases the likelihood that it’s more than chance, but it still could be that the bottom quartile just realized they didn’t know the answers, and that in another sample they would get a better result. Maybe the top quartile included guesses in their estimate and in another sample they would trade places with people on the third quartile who predicted a similar result but had worse guesses.
However, I believe the results of the second phase of study 3 can be extrapolated even if the first phase result is by chance, and it has more important consequences. When grading the tests of 5 representative peers’ the bottom quartile did a worse job, consistent with having a worse answer key (their own answers). And not being able to correctly evaluate their peers, they increased the estimate of their own score percentile from p60.5 to p65.4, when it was actually p10.1, even after seeing other people’s answers. The top quartile did a better evaluation of their peers’ answers and increased self evaluation from p69.5 to p79.7 (actual was p88.7).
So, I wouldn’t say the limitations of the test renders it useless. It appears to overestimate the error in assessment of absolute performance or performance in the topic in general, as you showed. But it independently demonstrated how using only their flawed answers people with wrong conceptions overestimate their relative performance, even after seeing the performance of others.
An abstraction of the experiment is to give people random answers to 20 questions and different confidence values assigned to each answer. Then give another 5 random test answer sheets (without confidence values in their answers) and ask people to rank the tests. Without knowing a priori which tests are the best performing, I assume it’s not possible to rank them with a bad first test and that the results will be similar to the paper’s, given a similar distribution of answers. If true, this is a natural consequence of the state, bound to always happen.
Mandatory comment when talking about overconfidence: This seems true to me. Please show my mistakes so I can improve.
“Unskilled and Unaware of it” (Kruger and Dunning, 1999)
I didn’t know about the Dunning-Kruger effect. It’s interessing.
It may be considered a bias, but in some sense it is not strictly irrational to be overconfident when you lack knowledge and have no way to measure the immensity of your ignorance. You have no frame of reference, you don’t realize that the space of possibilities is vast. I can imagine a Cro-magnon very confident about it’s understanding of the world. He listened to the Old Man when he was young, he knows his classics.
It’s only when you accumulate knowledge that you begin to realize how ignorant you were before without even noticing, like most people. This experience provides strong evidence that you should update your beliefs in favor of greater caution and humility. This is the essence of Socrates’ and Montaigne’s famous teachings.
The topic is cool but the argumentation is confusing. Here’s an AI version
...
This paper examines common misconceptions about the Dunning-Kruger effect and reveals statistical flaws in the original research.
The popular understanding of Dunning-Kruger is that incompetent people are so incompetent they think they’re actually better than experts—creating a curve where confidence peaks at low skill levels then drops before rising again. However, the actual Dunning-Kruger research shows something different: people at all skill levels tend to think they’re slightly above average. Those at the bottom still recognize they’re below average (just not how far below), and those at the top recognize they’re above average (just not how far above). The curve never actually inverts.
But even this finding appears to be a statistical artifact. When you group people by their performance on a test after the fact, random variation means the “low performers” group contains people who got unlucky, while the “high performers” group contains people who got lucky. This creates an illusion where low performers appear overconfident (they predicted their average ability but happened to underperform) and high performers appear underconfident (they predicted their average ability but happened to overperform). The author demonstrates this by simulating various scenarios—including one with perfectly calibrated people guessing coin flips—that all produce the classic Dunning-Kruger graph pattern despite having no actual overconfidence bias.
Studies that correct for this statistical issue find little to no Dunning-Kruger effect. However, the author argues that while the aggregate effect may be a mirage, individual instances of “being too incompetent to recognize your incompetence” obviously do occur sometimes, as do the opposite cases of underestimating your abilities. The key insight is that miscalibration about competence can go in either direction depending on the person and context, and the original research’s statistical methods don’t actually tell us which direction is more common or by how much.