I definitely expect that problems at this level of difficulty are within reach for present frontier models. That being said, as I understand it, most labs are still soliciting expert data and doing human-in-the-loop process reward modelling, and those that aren’t (mostly because they think RLVR is better and they have the spare compute) are still using the data they solicited in the past, or are distilling from models that used that data, or etc. etc. For basically the past two years, any math problem which is known to stump LLMs even occasionally is worth ~75 dollars to any contractor in any part of the world working as a data generator for companies like Scale AI. You should expect that any math problem which has been posted publicly, seen by more than ~50 people, and stated to be hard for LLMs in that time period has been trained on, detached from the canary string.
I don’t think it’s been trained on and all present frontier models one-shot it.
I definitely expect that problems at this level of difficulty are within reach for present frontier models. That being said, as I understand it, most labs are still soliciting expert data and doing human-in-the-loop process reward modelling, and those that aren’t (mostly because they think RLVR is better and they have the spare compute) are still using the data they solicited in the past, or are distilling from models that used that data, or etc. etc. For basically the past two years, any math problem which is known to stump LLMs even occasionally is worth ~75 dollars to any contractor in any part of the world working as a data generator for companies like Scale AI. You should expect that any math problem which has been posted publicly, seen by more than ~50 people, and stated to be hard for LLMs in that time period has been trained on, detached from the canary string.