testingthewaters comments on Burny’s Shortform

testingthewaters 19 Jul 2025 12:37 UTC
16 points
0
More interesting than the score is the implication that these were pass@1 results i.e. the model produced a single final “best shot” for each question that at the end of 4.5 hours was passed off to human graders, instead of pass@1000 with literal thousands of automated attempts. If true this suggests that test time scaling is now moving away from the “spray and pray” paradigm. Feels closer to “actually doing thinking”. This is kinda scary.
- Thane Ruthenis 19 Jul 2025 14:48 UTC
  8 points
  4
  Parent
  Eh. Scaffolds that involve agents privately iterating on ideas and then outputting a single result are a known approach, see e. g. this, or Deep Research, or possibly o1 pro/o3 pro. I expect it’s something along the same lines, except with some trick that makes it work better than ever before… Oh, come to think of it, Noam Brown did have that interview I was meaning to watch, about “scaling test-time compute to multi-agent civilizations”. That sounds relevant.
  I mean, it can be scary, for sure; no way to be certain until we see the details.