I can see the need for anonymity to avoid spoilers, but I think doing the thing publicly has benefits too—that way there’s the risk on the other side of having publicly denounced the Great Teacher when he was speaking truthfully.
You could have private points subtracted off and that gives you the same incentive not to make uncertain accusations. Attach confidence levels and take Bayes-score.
With the Bayes-score being always negative, I don’t see what incentive one would have to submit a mistake report. I think it would be better to test for better than, for example, 90% confidence, by awarding 1 point for a correct report and deducting 9 points for an incorrect report. This achieves the goal of detecting ability to detect bad arguments. Measuring calibration would have to be a seperate test.
Treat not submitting a mistake report as the “I have no idea” claim: that you’ve assigned a probability of “mistakes/total emails” to this particular email being a mistake.
I can see the need for anonymity to avoid spoilers, but I think doing the thing publicly has benefits too—that way there’s the risk on the other side of having publicly denounced the Great Teacher when he was speaking truthfully.
You could have private points subtracted off and that gives you the same incentive not to make uncertain accusations. Attach confidence levels and take Bayes-score.
With the Bayes-score being always negative, I don’t see what incentive one would have to submit a mistake report. I think it would be better to test for better than, for example, 90% confidence, by awarding 1 point for a correct report and deducting 9 points for an incorrect report. This achieves the goal of detecting ability to detect bad arguments. Measuring calibration would have to be a seperate test.
Treat not submitting a mistake report as the “I have no idea” claim: that you’ve assigned a probability of “mistakes/total emails” to this particular email being a mistake.