Yair Halberstadt comments on Is ProgramBench Impossible?

Yair Halberstadt 10 May 2026 9:39 UTC
2 points
0

Feedback from tests We can give the agent access to all of the tests or at least some of the tests either in a blackbox way (“x tests failed, please fix them”) or directly (“We tested behavior x and it failed with this error message”). This is more similar to how software engineers typically work and also how the Claude C compiler was built.

The risk of that is that the LLM can then trivially get 100% by hardcoding the response for each test case, rather than creating a generic solution.