o3 is reported to be fully parallelizable, capable of scaling up to $1000s of compute
If you’re referring to the ARC-AGI results, it was just pass@1024, for a nontrivial but not startling jump (75.7% to 87.5%). About the same ballpark as in the paper, plus we don’t actually know how much better its pass@1024 was than its pass@256.
The costs aren’t due to an astronomical k, but due to it writing a 55k-token novel for each attempt plus high $X/million output tokens. (Apparently the revised estimate is $600/million??? It was $60/million initially.)
(FrontierMath was pass@1. Though maybe they used consensus@k instead (outputting the most frequent answer out of k, with only one “final answer” passed to the task-specific verifier) or something.)
If you’re referring to the ARC-AGI results, it was just pass@1024
It was sampling 1024 reasoning traces, but I don’t expect it was being scored as pass@1024. An appropriate thing to happen there would be best-of-k or majority voting, not running reasoning 1024 times and seeing if at least one result happens to be correct.
If you’re referring to the ARC-AGI results, it was just pass@1024, for a nontrivial but not startling jump (75.7% to 87.5%). About the same ballpark as in the paper, plus we don’t actually know how much better its pass@1024 was than its pass@256.
The costs aren’t due to an astronomical k, but due to it writing a 55k-token novel for each attempt plus high $X/million output tokens. (Apparently the revised estimate is $600/million??? It was $60/million initially.)
(FrontierMath was pass@1. Though maybe they used consensus@k instead (outputting the most frequent answer out of k, with only one “final answer” passed to the task-specific verifier) or something.)
It was sampling 1024 reasoning traces, but I don’t expect it was being scored as pass@1024. An appropriate thing to happen there would be best-of-k or majority voting, not running reasoning 1024 times and seeing if at least one result happens to be correct.