Re: the imperfection of benchmarks, there is reason to believe SWE-bench scores have improved due to data contamination rather than pure model improvement (see “The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason”). It was released in October 2023, far before most current frontier models’ knowledge cutoff dates.
Possibly better alternatives would be SWE-bench-live (https://swe-bench-live.github.io/, based on Github tasks) or Live code bench (https://livecodebench.github.io/leaderboard.html, based on LeetCode/AtCoder/Codeforces problems). Interestingly enough, these two benchmarks have wildly different results for the top contenders, 19.26% v/s 80.2%. This seems to be mostly due to only the latter using reasoning models, but not only: “real” in context SWE tasks seem to be harder to figure out than code problems.
I don’t have evidence for this, but to me it does seem that AI companies are aware of this and keep using SWE-bench to keep their marketing hype. Data contamination is a known issue and it would be a glaring oversight to think it would not happen with such an “old” and public benchmark.
Re: the imperfection of benchmarks, there is reason to believe SWE-bench scores have improved due to data contamination rather than pure model improvement (see “The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason”). It was released in October 2023, far before most current frontier models’ knowledge cutoff dates.
Possibly better alternatives would be SWE-bench-live (https://swe-bench-live.github.io/, based on Github tasks) or Live code bench (https://livecodebench.github.io/leaderboard.html, based on LeetCode/AtCoder/Codeforces problems). Interestingly enough, these two benchmarks have wildly different results for the top contenders, 19.26% v/s 80.2%. This seems to be mostly due to only the latter using reasoning models, but not only: “real” in context SWE tasks seem to be harder to figure out than code problems.
I don’t have evidence for this, but to me it does seem that AI companies are aware of this and keep using SWE-bench to keep their marketing hype. Data contamination is a known issue and it would be a glaring oversight to think it would not happen with such an “old” and public benchmark.