They are sometimes able to make acceptable PRs, usually when context gathering for the purpose of iteratively building up a model of the relevant code is not a required part of generating said PR.
I think if this were right LLMs would already be useful for software engineering and able to make acceptable PRs.
I think LLMs can be useful for software engineering and can sometimes write acceptable PRs. (I’ve very clearly seen both of these first hand.) Maybe you meant something slightly weaker, like “AIs would be able to write acceptable PRs at a rate of >1/10 on large open source repos”? I think this is already probably true, at least with some scaffolding and inference time compute. Note that METR’s recent results were on 3.7 sonnet.
I’m referring to METR’s recent results. Can you point to any positive results on LLMs writing acceptable PRs? I’m sure that they can in some weak sense e.g. a sufficiently small project with sufficiently low standards, but as far as I remember the METR study concluded zero acceptable PRs in their context.
METR found that0⁄4 PRs which passed test cases and they reviewed were also acceptable to review. This was for 3.7 sonnet on large open source repos with default infrastructure.
The rate at which PRs passed test cases was also low, but if you’re focusing on the PR being viable to merge conditional on passing test cases, the “0/4” number is what you want. (And this is consistent with 10% or some chance of 35% of PRs being mergable conditional on passing test cases, we don’t have a very large sample size here.)
I don’t think this is much evidence that AI can’t sometimes write acceptable PRs in general and there examples of AIs doing this. On small projects I’ve worked on, AIs from a long time ago have written a big chunk of code ~zero-shot. Anecdotally, I’ve heard of people having success with AIs completing tasks zero-shot. I don’t know what you mean by “PR” that doesn’t include this.
I’m sure that they can in some weak sense e.g. a sufficiently small project with sufficiently low standards, but as far as I remember the METR study concluded zero acceptable PRs in their context.
It seems to me that current-state LLMs don’t learn nearly anything from the context since they have trouble fitting it into their attention span. For example, GPT-5 can create fun stuff from just one prompt and an unpublished LLM solved five out of six problems of IMO 2025, while the six problems together can be expressed by using 3k bytes. However, METR found that “on 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality.”
I strongly suspect that this bottleneck will be ameliorated by using neuralese[1] with big internal memory.
Neuralese with big internal memory
The Meta paper which introduced neuralese had GPT-2 trained to have the thought at the end fed into the beginning. Alas, the amount of bits transferred is equal to the amount of bits in a FLOP number multiplied by the size of the final layer. A potential CoT generates ~16.6 extra bits of information per activation.
At the cost of absolute loss of interpretability, neuralese on steroids could have the LLM of GPT-3′s scale transfer tens ofmillions of bits[2] in the latent space. Imagine GPT-3 175B (which had 96 layers and 12288 neurons in each) receiving an augmentation using the last layer’s results as a steering vector at the beginning, the pre-last layer as a steering vector at the second layer, etc. Or passing the steering vectors through a matrix. These amplifications, at most, double the compute required to run GPT-3, while requiring extra millions of bytes of dynamic memory.
For comparison, the human brain’s short-term memory alone is described by activations of around 86 billions of neurons. And that’s ignoring the middle-term memory and the long-term one...
I think if this were right LLMs would already be useful for software engineering and able to make acceptable PRs.
I also guess that the level of agency you need to actually beat Pokémon is require probably somewhere around 4 hours.
We’ll see who’s right—bet against me if you haven’t already! Though maybe it’s not a good deal anymore. I can see it going either way.
They are sometimes able to make acceptable PRs, usually when context gathering for the purpose of iteratively building up a model of the relevant code is not a required part of generating said PR.
I think LLMs can be useful for software engineering and can sometimes write acceptable PRs. (I’ve very clearly seen both of these first hand.) Maybe you meant something slightly weaker, like “AIs would be able to write acceptable PRs at a rate of >1/10 on large open source repos”? I think this is already probably true, at least with some scaffolding and inference time compute. Note that METR’s recent results were on 3.7 sonnet.
I’m referring to METR’s recent results. Can you point to any positive results on LLMs writing acceptable PRs? I’m sure that they can in some weak sense e.g. a sufficiently small project with sufficiently low standards, but as far as I remember the METR study concluded zero acceptable PRs in their context.
METR found that 0⁄4 PRs which passed test cases and they reviewed were also acceptable to review. This was for 3.7 sonnet on large open source repos with default infrastructure.
The rate at which PRs passed test cases was also low, but if you’re focusing on the PR being viable to merge conditional on passing test cases, the “0/4” number is what you want. (And this is consistent with 10% or some chance of 35% of PRs being mergable conditional on passing test cases, we don’t have a very large sample size here.)
I don’t think this is much evidence that AI can’t sometimes write acceptable PRs in general and there examples of AIs doing this. On small projects I’ve worked on, AIs from a long time ago have written a big chunk of code ~zero-shot. Anecdotally, I’ve heard of people having success with AIs completing tasks zero-shot. I don’t know what you mean by “PR” that doesn’t include this.
I think I already answered this:
It seems to me that current-state LLMs don’t learn nearly anything from the context since they have trouble fitting it into their attention span. For example, GPT-5 can create fun stuff from just one prompt and an unpublished LLM solved five out of six problems of IMO 2025, while the six problems together can be expressed by using 3k bytes. However, METR found that “on 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality.”
I strongly suspect that this bottleneck will be ameliorated by using neuralese[1] with big internal memory.
Neuralese with big internal memory
The Meta paper which introduced neuralese had GPT-2 trained to have the thought at the end fed into the beginning. Alas, the amount of bits transferred is equal to the amount of bits in a FLOP number multiplied by the size of the final layer. A potential CoT generates ~16.6 extra bits of information per activation.
At the cost of absolute loss of interpretability, neuralese on steroids could have the LLM of GPT-3′s scale transfer tens of millions of bits[2] in the latent space. Imagine GPT-3 175B (which had 96 layers and 12288 neurons in each) receiving an augmentation using the last layer’s results as a steering vector at the beginning, the pre-last layer as a steering vector at the second layer, etc. Or passing the steering vectors through a matrix. These amplifications, at most, double the compute required to run GPT-3, while requiring extra millions of bytes of dynamic memory.
For comparison, the human brain’s short-term memory alone is described by activations of around 86 billions of neurons. And that’s ignoring the middle-term memory and the long-term one...
However, there is Knight Lee’s proposal where the AIs are to generate multiple tokens instead of using versions of neuralese.
For comparison, the longest contest window is 1M tokens long and is used by Google Gemini. 1M tokens are represented by 16.6 M bits.
People have been talking about neuralese since at least when AI 2027 was published and I think much earlier, but it doesn’t seem to have materialized.