A bit of thought on this approach: in the past we’ve always designed agentic benchmarks with the goal that humans should be able to complete them alone (as a proof that the tasks are reliably possible and doable without external hints). However, requiring humans to solve a benchmark alone as part of the benchmark design process won’t scale to AIs with high enough task length horizons, since you’d have to get the human to work that long to verify the benchmark task.
I’d like to see other possibility proofs, such as the performance of humans when solving this benchmark with access to the latest frontier AI models.
A bit of thought on this approach: in the past we’ve always designed agentic benchmarks with the goal that humans should be able to complete them alone (as a proof that the tasks are reliably possible and doable without external hints). However, requiring humans to solve a benchmark alone as part of the benchmark design process won’t scale to AIs with high enough task length horizons, since you’d have to get the human to work that long to verify the benchmark task.
I’d like to see other possibility proofs, such as the performance of humans when solving this benchmark with access to the latest frontier AI models.
Yeah i totally agree, this is probably the right approach. I recently wrote about this type of benching.