So once an AI system trained end-to-end can produce similarly much value per token as a human researcher can produce per second, AI research will be more than fully automated. This means that, when AI first contributes more to AI research than humans do, the average research progress produced by 1 token of output will be significantly less than an average human AI researcher produces in a second of thinking.
Some additional evidence: o3 used 5.7B tokens per task to achieve its ARC score of 87.5%; it also scored 75.7% on low compute mode using 33M tokens per task:
Here’s one piece of (weak) evidence from the current SOTA on swebench:
’Median token usage per patch: 2.6 million tokens
90th percentile token usage: 11.82 million tokens’
Some additional evidence: o3 used 5.7B tokens per task to achieve its ARC score of 87.5%; it also scored 75.7% on low compute mode using 33M tokens per task:
https://arcprize.org/blog/oai-o3-pub-breakthrough
For the SOTA on swebench-verified as of 16-12-2024: ‘it was around $5k for a total run.. around 8M tokens for a single swebench-problem.’