ryan_greenblatt comments on nikola’s Shortform

ryan_greenblatt 5 Apr 2024 19:12 UTC
14 points
5
This seems mostly right to me and I would appreciate such an effort.

One nitpick:

The reason I think this would be good is because SWE-bench is probably the closest thing we have to a measure of how good LLMs are at software engineering and AI R&D related tasks

I expect this will improve over time and that SWE-bench won’t be our best fixed benchmark in a year or two. (SWE bench is only about 6 months old at this point!)

Also, I think if we put aside fixed benchmarks, we have other reasonable measures.
- nikola 6 Apr 2024 2:52 UTC
  1 point
  −2
  Parent
  I expect us to reach a level where at least 40% of the ML research workflow can be automated by the time we saturate (reach 90%) on SWE-bench. I think we’ll be comfortably inside takeoff by that point (software progress at least 2.5x faster than right now). Wonder if you share this impression?
  - ryan_greenblatt 6 Apr 2024 3:38 UTC
    11 points
    3
    Parent
    It seems super non-obvious to me when SWE-bench saturates relative to ML automation. I think the SWE-bench task distribution is very different from ML research work flow in a variety of ways.
    
    Also, I think that human expert performance on SWE-bench is well below 90% if you use the exact rules they use in the paper. I messaged you explaining why I think this. The TLDR: it seems like test cases are often implementation dependent and the current rules from the paper don’t allow looking at the test cases.