ProgramBench is a new coding benchmark that all frontier models spectacularly fail. We’ve been on a quest for “hard benchmarks” for a while so it’s refreshing to see a benchmark where top models do badly. Unfortunately, ProgramBench has one big problem: it’s impossible!

What is ProgramBench?

ProgramBench tests if a model can recreate a program from a “clean room” environment. The model is given only a bit of documentation and black-box access to the program (all the programs are CLIs), then tasked with re-implementing it.

How does ProgramBench know if the implementation is correct? It also generates a bunch of unit tests for the program^[1]. The re-implementing coding agent doesn’t have access to any of those tests. The coding agent only considers a task “resolved” if it passes all of the tests and “almost resolved” if it passes 95% of them.

Why is this problematic?
Obscure behavior can enter the unit tests without being in the clean room path. An extreme version of this is a backdoor: program that behaves in one way most of the time but behaves totally differently when exposed to a specific string. This wouldn’t make a task literally impossible, just incredibly hard in a way that is orthogonal to intelligence.

Software Backdoor Illustration May 8 2026.png

The model will not reasonably try that string, unless it knows about it from the docs or from some sort of “gray box” access / reverse engineering^[2]. The authors are aware of this problem and even mention it in the paper:

Could obscure program behaviors be impossible or arbitrarily hard to discover? [...]
Conceptually, one scenario where behavior is borderline impossible to discover is if an executable supports functionality that is not communicated or documented via any observable channel. In other words, there is functionality that is not revealed by the README.md, --help flag standard output, or any artifacts that could be unveiled by typical exploratory actions.

This seems like a theoretical issue, does it actually happen?

I think so!

One of the tasks in ProgramBench is seqtk. It’s a popular computational biology CLI that analyzes protein sequences. Check it out on ProgramBench here.

The program has two sub-commands hrum and kfreq^[3], that are tested for but not documented in the clean room.

What can we do differently?

ProgramBench is awesome and we can learn a ton from it. Here are some improvements I’d love to see in future benchmarks / iterations:

Downstream unit testing
Instead of autogenerating the unit tests to fill any gaps, we can see if software that depends on it still works. Errors typically propagate destructively in most software so downstream unit tests might even be more informative than direct ones^[4]. The Claude C compiler does something similar: it compiles the linux kernel and see if it runs or not.

Weighted testing
Some tests are much more important than others and that should be reflected in the benchmark. Perhaps we can score them: 100 if it’s a really important test, 1 if it’s a lightly used and not very documented feature. Instead of testing if everything passes, we could see some kind of weighted score, which would be more robust to unit test quality issues.

More quality control
We should mostly just test behaviors that the coding agent can “reasonably find”. There might not be a clear cut definition for “reasonably find” that everyone agrees on, but we can definitely do better than what we have now.

Feedback from tests
We can give the agent access to all of the tests or at least some of the tests either in a blackbox way (“x tests failed, please fix them”) or directly (“We tested behavior x and it failed with this error message”). This is more similar to how software engineers typically work and also how the Claude C compiler was built.

^
This is an awesome idea that unlocks a bunch of things and Im excited to see it developed
^
fuzzing binary targets
^
homopolymer run finder and k-mer frequency / neighborhood counter
^
The law of leaky abstractions

Is ProgramBench Impossible?

frmsaul8 May 2026 17:04 UTC

70 points

4 comments2 min readLW link

What links here?

Mo Putera's comment on Mo Putera’s Shortform by Mo Putera (4 Nov 2025 12:47 UTC; 16 points)

Roman Malov 8 May 2026 20:02 UTC
23 points
13
Maybe I’m not looking in the right place, but the obvious question to this benchmark—how do humans fair in it? If humans score 0 too, then models scoring 0 is not a huge signal (even though authors claim that this bench is supposed to be closer to work of a real engineer).
Mis-Understandings 8 May 2026 19:30 UTC
21 points
14
The deeper problem is that the benchmark works by aggregating over these units tests, but a threshold is the wrong sort of aggregation here, instead we would really want to visualize the full distribution of unit test passes and samples, and so the benchmark is too convex.
You can get any benchmark to be sharper by just saying, take n questions, if you get any wrong you get scored as failing, but then it will have a sharper sigmoid past the critical point, so it is not actually a useful benchmark to do so except to visualize.
Eye You 9 May 2026 17:39 UTC
5 points
1
I think I found two more BIG problems with the eval.

First: I looked into the tests a little more. Every task test suite I saw has many ignored_tests for “reason: gold_fail”. Which apparently means that the reference solution itself fails the test. This one has 79 tests ignored due to gold_fail, which is ~10% of its total tests.
This seems really bad! It makes me think that there is something bad about the way they are generating tests and that the tests don’t really correspond to the program being “correct”. Epistemic status: this is my first time learning about this “gold_fail” and I am not a professional software engineer.

Second:
Opus 4.7 scores 2.9% but Sonnet 4.6 scores 71.5%? No way. Something has gotta be broken here.
Caleb Biddulph 8 May 2026 21:14 UTC
3 points
0
Maybe you could define “success” to mean “a fixed LLM can’t construct an adversarial test case that your code fails on, given access to both your code and the original program’s code.” The coder agent starts with the full set of test cases, and the coder and the tester go back and forth until the tester fails or the coder runs out of attempts.
It’s a little less elegant and more expensive to bake a separate LLM into the benchmark, but since the tester’s job should be much easier than the coder’s job, it can probably be a relatively small LLM.