This increases the samples in SWE-Bench Verified significantly, with the number of test cases growing exponentially as the AI itself contributes to the benchmark’s expansion.
“We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork...”—https://arxiv.org/abs/2502.12115
“We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork...”—https://arxiv.org/abs/2502.12115