Is ProgramBench Impossible?
ProgramBench is a new coding benchmark that all frontier models spectacularly fail. We’ve been on a quest for “hard benchmarks” for a while so it’s refreshing to see a benchmark where top models do badly. Unfortunately, ProgramBench has one big problem: it’s impossible!

What is ProgramBench?
ProgramBench tests if a model can recreate a program from a “clean room” environment. The model is given only a bit of documentation and black-box access to the program (all the programs are CLIs), then tasked with re-implementing it.
How does ProgramBench know if the implementation is correct? It also generates a bunch of unit tests for the program[1]. The re-implementing coding agent doesn’t have access to any of those tests. The coding agent only considers a task “resolved” if it passes all of the tests and “almost resolved” if it passes 95% of them.
Why is this problematic?
Obscure behavior can enter the unit tests without being in the clean room path. An extreme version of this is a backdoor: program that behaves in one way most of the time but behaves totally differently when exposed to a specific string. This wouldn’t make a task literally impossible, just incredibly hard in a way that is orthogonal to intelligence.
A backdoor
The model will not reasonably try that string, unless it knows about it from the docs or from some sort of “gray box” access / reverse engineering[2]. The authors are aware of this problem and even mention it in the paper:
Could obscure program behaviors be impossible or arbitrarily hard to discover? [...]
Conceptually, one scenario where behavior is borderline impossible to discover is if an executable supports functionality that is not communicated or documented via any observable channel. In other words, there is functionality that is not revealed by the README.md, --help flag standard output, or any artifacts that could be unveiled by typical exploratory actions.
This seems like a theoretical issue, does it actually happen?
I think so!
One of the tasks in ProgramBench is seqtk. It’s a popular computational biology CLI that analyzes protein sequences. Check it out on ProgramBench here.
The program has two sub-commands hrum and kfreq[3], that are tested for but not documented in the clean room.
The test generation agent knows that these behaviors are not documented
What can we do differently?
ProgramBench is awesome and we can learn a ton from it. Here are some improvements I’d love to see in future benchmarks / iterations:
Downstream unit testing
Instead of autogenerating the unit tests to fill any gaps, we can see if software that depends on it still works. Errors typically propagate destructively in most software so downstream unit tests might even be more informative than direct ones[4]. The Claude C compiler does something similar: it compiles the linux kernel and see if it runs or not.
Weighted testing
Some tests are much more important than others and that should be reflected in the benchmark. Perhaps we can score them: 100 if it’s a really important test, 1 if it’s a lightly used and not very documented feature. Instead of testing if everything passes, we could see some kind of weighted score, which would be more robust to unit test quality issues.
More quality control
We should mostly just test behaviors that the coding agent can “reasonably find”. There might not be a clear cut definition for “reasonably find” that everyone agrees on, but we can definitely do better than what we have now.
Feedback from tests
We can give the agent access to all of the tests or at least some of the tests either in a blackbox way (“x tests failed, please fix them”) or directly (“We tested behavior x and it failed with this error message”). This is more similar to how software engineers typically work and also how the Claude C compiler was built.
Maybe I’m not looking in the right place, but the obvious question to this benchmark—how do humans fair in it? If humans score 0 too, then models scoring 0 is not a huge signal (even though authors claim that this bench is supposed to be closer to work of a real engineer).
The deeper problem is that the benchmark works by aggregating over these units tests, but a threshold is the wrong sort of aggregation here, instead we would really want to visualize the full distribution of unit test passes and samples, and so the benchmark is too convex.
You can get any benchmark to be sharper by just saying, take n questions, if you get any wrong you get scored as failing, but then it will have a sharper sigmoid past the critical point, so it is not actually a useful benchmark to do so except to visualize.
I think I found two more BIG problems with the eval.
First: I looked into the tests a little more. Every task test suite I saw has many ignored_tests for “reason: gold_fail”. Which apparently means that the reference solution itself fails the test. This one has 79 tests ignored due to gold_fail, which is ~10% of its total tests.
This seems really bad! It makes me think that there is something bad about the way they are generating tests and that the tests don’t really correspond to the program being “correct”. Epistemic status: this is my first time learning about this “gold_fail” and I am not a professional software engineer.
Second:
Opus 4.7 scores 2.9% but Sonnet 4.6 scores 71.5%? No way. Something has gotta be broken here.
Maybe you could define “success” to mean “a fixed LLM can’t construct an adversarial test case that your code fails on, given access to both your code and the original program’s code.” The coder agent starts with the full set of test cases, and the coder and the tester go back and forth until the tester fails or the coder runs out of attempts.
It’s a little less elegant and more expensive to bake a separate LLM into the benchmark, but since the tester’s job should be much easier than the coder’s job, it can probably be a relatively small LLM.