I think the extra effort required to go from algorithmically to holistically qualifying scales linearly with task difficulty. Dense reward model scaling on hard to verify tasks seems to have cracked this. Deepminds polished holistically passing IMO solutions probably required the same order of magnitude of compute/effort as the technically correct but less polished OpenAI IMO solutions. (They used similar levels of models, compute, and time to get their respective results)
Wondering why you think this
I think basically every time someone has a story like this it’s wrong. I don’t understand why people seem so eager to blame cultural forces for ubiquitous behavior in this fashion. I guess it makes humans seem more interesting.