2001zhaozhao comments on Is ProgramBench Impossible?

2001zhaozhao 11 May 2026 18:36 UTC
4 points
0
A bit of thought on this approach: in the past we’ve always designed agentic benchmarks with the goal that humans should be able to complete them alone (as a proof that the tasks are reliably possible and doable without external hints). However, requiring humans to solve a benchmark alone as part of the benchmark design process won’t scale to AIs with high enough task length horizons, since you’d have to get the human to work that long to verify the benchmark task.
I’d like to see other possibility proofs, such as the performance of humans when solving this benchmark with access to the latest frontier AI models.
- frmsaul 11 May 2026 20:06 UTC
  1 point
  0
  Parent
  Yeah i totally agree, this is probably the right approach. I recently wrote about this type of benching.