I think the extra effort required to go from algorithmically to holistically qualifying scales linearly with task difficulty. Dense reward model scaling on hard to verify tasks seems to have cracked this. Deepminds polished holistically passing IMO solutions probably required the same order of magnitude of compute/effort as the technically correct but less polished OpenAI IMO solutions. (They used similar levels of models, compute, and time to get their respective results)
So while it will shift timelines, it is something that will fall to scale and thus shouldn’t shift it too much.
I predict once these methods make their way into commercial models, this will go away, or roughly 1 year. I’ll check back in 2026 to see if I’m wrong.
I think the extra effort required to go from algorithmically to holistically qualifying scales linearly with task difficulty. Dense reward model scaling on hard to verify tasks seems to have cracked this. Deepminds polished holistically passing IMO solutions probably required the same order of magnitude of compute/effort as the technically correct but less polished OpenAI IMO solutions. (They used similar levels of models, compute, and time to get their respective results)
I think the extra effort required to go from algorithmically to holistically qualifying scales linearly with task difficulty. Dense reward model scaling on hard to verify tasks seems to have cracked this. Deepminds polished holistically passing IMO solutions probably required the same order of magnitude of compute/effort as the technically correct but less polished OpenAI IMO solutions. (They used similar levels of models, compute, and time to get their respective results)
So while it will shift timelines, it is something that will fall to scale and thus shouldn’t shift it too much.
I predict once these methods make their way into commercial models, this will go away, or roughly 1 year. I’ll check back in 2026 to see if I’m wrong.
Wondering why you think this