catawampless comments on Is ProgramBench Impossible?

catawampless 17 May 2026 20:25 UTC
1 point
0
Downstream unit testing
Instead of autogenerating the unit tests to fill any gaps, we can see if software that depends on it still works.
Simon Willison discusses an early version of this employed by the StrongDM team. For context, they build Digital Twins of all the software their system depends on, and have agents run QA testing continually against those digital twins. To verify correctness of the digital twins, one technique they use is verifying client libraries that make use of their dependencies still work. Direct quote from the article:
I did have an initial key insight which led to a repeatable strategy to ensure a high level of fidelity between DTU vs. the official canonical SaaS services:
Use the top popular publicly available reference SDK client libraries as compatibility targets, with the goal always being 100% compatibility.
I could see a version of this where a benchmark like ProgramBench is built, but the bar is that libraries that make use of those dependencies still pass their integration test suite with a swapped out implementation