Fwiw, human underperformance on vending-bench is an elicitation problem. The human was only given a set amount of time (iirc 5 hours), whereas the LLMs run until they lose coherence. The human maintained coherence throughout but types slower, and therefore sees less days of simulation.
Fwiw, human underperformance on vending-bench is an elicitation problem. The human was only given a set amount of time (iirc 5 hours), whereas the LLMs run until they lose coherence. The human maintained coherence throughout but types slower, and therefore sees less days of simulation.
The human is also the only case with n=1.