What we’ve currently published is ‘number of agents that completed each task’, which has a similar effect of making comparisons between models harder—does that seem like it addresses the downside sufficiently?
What we’ve currently published is ‘number of agents that completed each task’, which has a similar effect of making comparisons between models harder—does that seem like it addresses the downside sufficiently?