Devin can sometimes (13.8% of the time?!) do actual real jobs on Upwork with nothing but a prompt to ‘figure it out.’
You imply in this post that SWE-bench corresponds to jobs on Upwork. This is incorrect, SWE-bench corresponds to issue and pull request pairs on 12 python repos.
You imply in this post that SWE-bench corresponds to jobs on Upwork. This is incorrect, SWE-bench corresponds to issue and pull request pairs on 12 python repos.