So after an enjoyable evening of coding up some heuristics and then having very little clue of how to combine and then translate them into probabilities, I realized that my only chance to win was to hope that the data set was in some ways easy, meaning that most participants would get almost everything “right”, and the winner might be determined by who was overconfident enough.
Don’t get me wrong, my heuristics didn’t perform all that well, but I do wonder how much of the “overconfidence” we see is a result of actual miscalibration versus strategic. If you think your discrimination is working really well, you probably want to gamble that it’s working better than everyone else’s, but if you think it’s not working so well, it does seem like the only chance you have of winning is overconfidence plus luck.
For what it’s worth, the top three finishers were three of the four most calibrated contestants! With this many strings, I think being intentionally overconfident as a bad strategy. (I agree it would make sense if there were like 10 or 20 strings.)
I think it depends a lot more on the number of strings you get wrong than on the total number of strings, so I think GuySrinivasan has a good point that deliberate overconfidence would be viable if the dataset were easy. I was thinking the same thing at the start, but gave it up when it became clear my heuristics weren’t giving enough information.
My own theory though was that most overconfidence wasn’t deliberate but simply from people not thinking through how much information they were getting from apparent non-randomness (i.e. the way I compared my results to what would be expected by chance).
Yep, this is indeed a reason proper scoring rules don’t remain proper if 1) you only have a small sample size of questions, and 2) utility of winning is not linear in the points you obtain (for example, if you really care about being in the top 3, much more than any particular amount of points).
Some people have debated whether it was happening in the Good Judgement tournaments. If so, that might explain why extremizing algorithms improved performance. (Though I recall not being convinced that it was actually happening there). When Metaculus ran its crypto competition a few years ago they also did some analysis to check if this phenomenon was present, yet they couldn’t detect it.
I enjoyed this! I performed really poorly!
So after an enjoyable evening of coding up some heuristics and then having very little clue of how to combine and then translate them into probabilities, I realized that my only chance to win was to hope that the data set was in some ways easy, meaning that most participants would get almost everything “right”, and the winner might be determined by who was overconfident enough.
Don’t get me wrong, my heuristics didn’t perform all that well, but I do wonder how much of the “overconfidence” we see is a result of actual miscalibration versus strategic. If you think your discrimination is working really well, you probably want to gamble that it’s working better than everyone else’s, but if you think it’s not working so well, it does seem like the only chance you have of winning is overconfidence plus luck.
For what it’s worth, the top three finishers were three of the four most calibrated contestants! With this many strings, I think being intentionally overconfident as a bad strategy. (I agree it would make sense if there were like 10 or 20 strings.)
I think it depends a lot more on the number of strings you get wrong than on the total number of strings, so I think GuySrinivasan has a good point that deliberate overconfidence would be viable if the dataset were easy. I was thinking the same thing at the start, but gave it up when it became clear my heuristics weren’t giving enough information.
My own theory though was that most overconfidence wasn’t deliberate but simply from people not thinking through how much information they were getting from apparent non-randomness (i.e. the way I compared my results to what would be expected by chance).
Yep, this is indeed a reason proper scoring rules don’t remain proper if 1) you only have a small sample size of questions, and 2) utility of winning is not linear in the points you obtain (for example, if you really care about being in the top 3, much more than any particular amount of points).
Some people have debated whether it was happening in the Good Judgement tournaments. If so, that might explain why extremizing algorithms improved performance. (Though I recall not being convinced that it was actually happening there). When Metaculus ran its crypto competition a few years ago they also did some analysis to check if this phenomenon was present, yet they couldn’t detect it.