Regarding 1 and 2, I basically agree that SWAA doesn’t provide much independent signal. The reason we made SWAA was that models before GPT-4 got ~0% on HCAST, so we needed shorter tasks to measure their time horizon. 3 is definitely a concern and we’re currently collecting data on open-source PRs to get a more representative sample of long tasks.
Regarding 1 and 2, I basically agree that SWAA doesn’t provide much independent signal. The reason we made SWAA was that models before GPT-4 got ~0% on HCAST, so we needed shorter tasks to measure their time horizon. 3 is definitely a concern and we’re currently collecting data on open-source PRs to get a more representative sample of long tasks.