Thomas Kwa comments on METR: Measuring AI Ability to Complete Long Tasks

Thomas Kwa 21 Mar 2025 2:01 UTC
6 points
2
Regarding 1 and 2, I basically agree that SWAA doesn’t provide much independent signal. The reason we made SWAA was that models before GPT-4 got ~0% on HCAST, so we needed shorter tasks to measure their time horizon. 3 is definitely a concern and we’re currently collecting data on open-source PRs to get a more representative sample of long tasks.