If you’re referring to the ARC-AGI results, it was just pass@1024
It was sampling 1024 reasoning traces, but I don’t expect it was being scored as pass@1024. An appropriate thing to happen there would be best-of-k or majority voting, not running reasoning 1024 times and seeing if at least one result happens to be correct.
It was sampling 1024 reasoning traces, but I don’t expect it was being scored as pass@1024. An appropriate thing to happen there would be best-of-k or majority voting, not running reasoning 1024 times and seeing if at least one result happens to be correct.