O O comments on METR Research Update: Algorithmic vs. Holistic Evaluation

O O 15 Aug 2025 23:34 UTC
5 points
0
I think the extra effort required to go from algorithmically to holistically qualifying scales linearly with task difficulty. Dense reward model scaling on hard to verify tasks seems to have cracked this. Deepminds polished holistically passing IMO solutions probably required the same order of magnitude of compute/effort as the technically correct but less polished OpenAI IMO solutions. (They used similar levels of models, compute, and time to get their respective results)

So while it will shift timelines, it is something that will fall to scale and thus shouldn’t shift it too much.

I predict once these methods make their way into commercial models, this will go away, or roughly 1 year. I’ll check back in 2026 to see if I’m wrong.
- lc 17 Aug 2025 7:54 UTC
  3 points
  0
  Parent
  
  I think the extra effort required to go from algorithmically to holistically qualifying scales linearly with task difficulty. Dense reward model scaling on hard to verify tasks seems to have cracked this. Deepminds polished holistically passing IMO solutions probably required the same order of magnitude of compute/effort as the technically correct but less polished OpenAI IMO solutions. (They used similar levels of models, compute, and time to get their respective results)
  
  Wondering why you think this