I looked at how this paper actually measures the relationship between reasoning length and performance, and there’s a potential confounding issue worth noting:
The authors prompt models to use specific reasoning budgets (like “think for 4,096 tokens”) then measure performance vs. actual tokens used. Within each budget, some responses end up longer than others. The problem: if a model gets confused on a question, it might naturally reason longer AND get the answer wrong, even under the same token budget constraint.
So we might be seeing “confusion causes both length and errors” rather than “length causes errors.” The paper doesn’t really address this. They control for question difficulty but not for whether the model got confused/stuck on a particular instance.
I looked at how this paper actually measures the relationship between reasoning length and performance, and there’s a potential confounding issue worth noting:
The authors prompt models to use specific reasoning budgets (like “think for 4,096 tokens”) then measure performance vs. actual tokens used. Within each budget, some responses end up longer than others. The problem: if a model gets confused on a question, it might naturally reason longer AND get the answer wrong, even under the same token budget constraint.
So we might be seeing “confusion causes both length and errors” rather than “length causes errors.” The paper doesn’t really address this. They control for question difficulty but not for whether the model got confused/stuck on a particular instance.