Maybe you could define “success” to mean “a fixed LLM can’t construct an adversarial test case that your code fails on, given access to both your code and the original program’s code.” The coder agent starts with the full set of test cases, and the coder and the tester go back and forth until the tester fails or the coder runs out of attempts.
It’s a little less elegant and more expensive to bake a separate LLM into the benchmark, but since the tester’s job should be much easier than the coder’s job, it can probably be a relatively small LLM.
Maybe you could define “success” to mean “a fixed LLM can’t construct an adversarial test case that your code fails on, given access to both your code and the original program’s code.” The coder agent starts with the full set of test cases, and the coder and the tester go back and forth until the tester fails or the coder runs out of attempts.
It’s a little less elegant and more expensive to bake a separate LLM into the benchmark, but since the tester’s job should be much easier than the coder’s job, it can probably be a relatively small LLM.