A school that reports to each student their class ranking easily games this test. The test could even favor schools that don’t teach students enough to question an arbitrary class rank.
Also, this doesn’t consider the possibility that students can be good rationalists, but don’t interact with enough of the other students to make a good assessment of their relative strengths.
With the Bayes-score being always negative, I don’t see what incentive one would have to submit a mistake report. I think it would be better to test for better than, for example, 90% confidence, by awarding 1 point for a correct report and deducting 9 points for an incorrect report. This achieves the goal of detecting ability to detect bad arguments. Measuring calibration would have to be a seperate test.