Aaron Staley comments on Cole Wyeth’s Shortform

Aaron Staley 19 Jul 2025 0:06 UTC
12 points
3
If I understand correctly, Claude’s pass@X benchmarks mean multiple sampling and taking the best result. This is valid so long as compute cost isn’t exceeding equivalent cost of an engineer.
codex’s pass @ 8 score seems to be saying “the correct solution was present in 8 attempts, but the model doesn’t actually know what the correct result is”. That shouldn’t count.