METR found that0⁄4 PRs which passed test cases and they reviewed were also acceptable to review. This was for 3.7 sonnet on large open source repos with default infrastructure.
The rate at which PRs passed test cases was also low, but if you’re focusing on the PR being viable to merge conditional on passing test cases, the “0/4” number is what you want. (And this is consistent with 10% or some chance of 35% of PRs being mergable conditional on passing test cases, we don’t have a very large sample size here.)
I don’t think this is much evidence that AI can’t sometimes write acceptable PRs in general and there examples of AIs doing this. On small projects I’ve worked on, AIs from a long time ago have written a big chunk of code ~zero-shot. Anecdotally, I’ve heard of people having success with AIs completing tasks zero-shot. I don’t know what you mean by “PR” that doesn’t include this.
I’m sure that they can in some weak sense e.g. a sufficiently small project with sufficiently low standards, but as far as I remember the METR study concluded zero acceptable PRs in their context.
METR found that 0⁄4 PRs which passed test cases and they reviewed were also acceptable to review. This was for 3.7 sonnet on large open source repos with default infrastructure.
The rate at which PRs passed test cases was also low, but if you’re focusing on the PR being viable to merge conditional on passing test cases, the “0/4” number is what you want. (And this is consistent with 10% or some chance of 35% of PRs being mergable conditional on passing test cases, we don’t have a very large sample size here.)
I don’t think this is much evidence that AI can’t sometimes write acceptable PRs in general and there examples of AIs doing this. On small projects I’ve worked on, AIs from a long time ago have written a big chunk of code ~zero-shot. Anecdotally, I’ve heard of people having success with AIs completing tasks zero-shot. I don’t know what you mean by “PR” that doesn’t include this.
I think I already answered this: