More interesting than the score is the implication that these were pass@1 results i.e. the model produced a single final “best shot” for each question that at the end of 4.5 hours was passed off to human graders, instead of pass@1000 with literal thousands of automated attempts. If true this suggests that test time scaling is now moving away from the “spray and pray” paradigm. Feels closer to “actually doing thinking”. This is kinda scary.
Eh. Scaffolds that involve agents privately iterating on ideas and then outputting a single result are a known approach, see e. g. this, or Deep Research, or possibly o1 pro/o3 pro. I expect it’s something along the same lines, except with some trick that makes it work better than ever before… Oh, come to think of it, Noam Brown did have that interview I was meaning to watch, about “scaling test-time compute to multi-agent civilizations”. That sounds relevant.
I mean, it can be scary, for sure; no way to be certain until we see the details.
More interesting than the score is the implication that these were pass@1 results i.e. the model produced a single final “best shot” for each question that at the end of 4.5 hours was passed off to human graders, instead of pass@1000 with literal thousands of automated attempts. If true this suggests that test time scaling is now moving away from the “spray and pray” paradigm. Feels closer to “actually doing thinking”. This is kinda scary.
Eh. Scaffolds that involve agents privately iterating on ideas and then outputting a single result are a known approach, see e. g. this, or Deep Research, or possibly o1 pro/o3 pro. I expect it’s something along the same lines, except with some trick that makes it work better than ever before… Oh, come to think of it, Noam Brown did have that interview I was meaning to watch, about “scaling test-time compute to multi-agent civilizations”. That sounds relevant.
I mean, it can be scary, for sure; no way to be certain until we see the details.