Curated. Like your earlier post on filler tokens, my guess is that this is a pretty meaningful measurement of something plausibly serving as an input to actually dangerous capabilities. It’s a very obvious experiment to run in hindsight, and it’s not even like the question hasn’t been discussed before, but that doesn’t count for anything until someone actually runs the experiment. The experiment design seems reasonable to me, and if I had to guess I’d say the results are pretty suggestive of model size (though Opus 4.5 doing better than 4 is perhaps a little surprising there).
I do wish that the estimated completion times didn’t themselves come from LLM estimates. I more or less trust Opus 4.5 to generate an unbiased ordering of problem difficulty, but I’m not sure about relative (let alone absolute) differences in completion time. This isn’t that much of a defeater; even the ground-truth measurement of actual human completion times would still just be a lossy proxy for the actual thing we care about.
Curated. Like your earlier post on filler tokens, my guess is that this is a pretty meaningful measurement of something plausibly serving as an input to actually dangerous capabilities. It’s a very obvious experiment to run in hindsight, and it’s not even like the question hasn’t been discussed before, but that doesn’t count for anything until someone actually runs the experiment. The experiment design seems reasonable to me, and if I had to guess I’d say the results are pretty suggestive of model size (though Opus 4.5 doing better than 4 is perhaps a little surprising there).
I do wish that the estimated completion times didn’t themselves come from LLM estimates. I more or less trust Opus 4.5 to generate an unbiased ordering of problem difficulty, but I’m not sure about relative (let alone absolute) differences in completion time. This isn’t that much of a defeater; even the ground-truth measurement of actual human completion times would still just be a lossy proxy for the actual thing we care about.
Anyways, great work.