Fwiw, human underperformance on vending-bench is an elicitation problem. The human was only given a set amount of time (iirc 5 hours), whereas the LLMs run until they lose coherence. The human maintained coherence throughout but types slower, and therefore sees less days of simulation.
The human is also the only case with n=1.
2 cents, not based on source code or these specific claims: It’s a terminal app written in javascript, which is often slow/flickery because it redraws the entire screen whenever something changes. Whereas Codex fast-followed & very quickly was rewritten in Rust, and its ux is noticeably snappier.