Perhaps an intermediate setup between the standard method of creating benchmarks and the one you describe is to make a benchmark that has a “manager LLM” that can be queried by the agent. We would design each task in the benchmark with two sets of instructions: 1. Normal high-level instructions given to the agent at the start of the task that are akin to instructions a SWE would get, and 2. Additional, more specific instructions describing desired behavior that is given to the manager as context. Whenever the agent queries the manager, the manager will either answer that question if it has additional information relevant to the question, or will not respond if it doesn’t. You could also grade the model on the number of questions it needs to ask in order to get the correct final code, although whether some prompt counts as one question, or two, or more, can sometimes be hard to adjudicate.
This would be more tricky to setup and has some challenges. For example, you’d want to make sure the manager doesn’t overshare. But it could more accurately simulate the settings the model would actually find itself in in practice.
Perhaps an intermediate setup between the standard method of creating benchmarks and the one you describe is to make a benchmark that has a “manager LLM” that can be queried by the agent. We would design each task in the benchmark with two sets of instructions: 1. Normal high-level instructions given to the agent at the start of the task that are akin to instructions a SWE would get, and 2. Additional, more specific instructions describing desired behavior that is given to the manager as context. Whenever the agent queries the manager, the manager will either answer that question if it has additional information relevant to the question, or will not respond if it doesn’t. You could also grade the model on the number of questions it needs to ask in order to get the correct final code, although whether some prompt counts as one question, or two, or more, can sometimes be hard to adjudicate.
This would be more tricky to setup and has some challenges. For example, you’d want to make sure the manager doesn’t overshare. But it could more accurately simulate the settings the model would actually find itself in in practice.