This seems mostly right to me and I would appreciate such an effort.
One nitpick:
The reason I think this would be good is because SWE-bench is probably the closest thing we have to a measure of how good LLMs are at software engineering and AI R&D related tasks
I expect this will improve over time and that SWE-bench won’t be our best fixed benchmark in a year or two. (SWE bench is only about 6 months old at this point!)
Also, I think if we put aside fixed benchmarks, we have other reasonable measures.
I expect us to reach a level where at least 40% of the ML research workflow can be automated by the time we saturate (reach 90%) on SWE-bench. I think we’ll be comfortably inside takeoff by that point (software progress at least 2.5x faster than right now). Wonder if you share this impression?
It seems super non-obvious to me when SWE-bench saturates relative to ML automation. I think the SWE-bench task distribution is very different from ML research work flow in a variety of ways.
Also, I think that human expert performance on SWE-bench is well below 90% if you use the exact rules they use in the paper. I messaged you explaining why I think this. The TLDR: it seems like test cases are often implementation dependent and the current rules from the paper don’t allow looking at the test cases.
This seems mostly right to me and I would appreciate such an effort.
One nitpick:
I expect this will improve over time and that SWE-bench won’t be our best fixed benchmark in a year or two. (SWE bench is only about 6 months old at this point!)
Also, I think if we put aside fixed benchmarks, we have other reasonable measures.
I expect us to reach a level where at least 40% of the ML research workflow can be automated by the time we saturate (reach 90%) on SWE-bench. I think we’ll be comfortably inside takeoff by that point (software progress at least 2.5x faster than right now). Wonder if you share this impression?
It seems super non-obvious to me when SWE-bench saturates relative to ML automation. I think the SWE-bench task distribution is very different from ML research work flow in a variety of ways.
Also, I think that human expert performance on SWE-bench is well below 90% if you use the exact rules they use in the paper. I messaged you explaining why I think this. The TLDR: it seems like test cases are often implementation dependent and the current rules from the paper don’t allow looking at the test cases.