Thomas Kwa comments on Thomas Kwa’s Shortform

Thomas Kwa 19 Jun 2025 5:05 UTC
LW: 10 AF: 2
0
AF
New graph with better data, formatting still wonky though. Colleagues say it reminds them of a subway map.
With individual question data from Epoch, and making an adjustment for human success rate (adjusted task length = avg human time / human success rate), AIME looks closer to the others, and it’s clear that GPQA Diamond has saturated.