One takeaway from this would be “CoT is more accurate at the limit of the model’s capabilities”. Given this, you could have a policy of only using the least capable model for every task to make CoT more influential. Of course this means you always get barely-passable model performance, which is unfortunate. Also, people often use CoT in cases where it intuitively wouldn’t help, such as common sense NLP questions, where I’d expect influential CoT to be pretty unnatural.
One takeaway from this would be “CoT is more accurate at the limit of the model’s capabilities”. Given this, you could have a policy of only using the least capable model for every task to make CoT more influential. Of course this means you always get barely-passable model performance, which is unfortunate. Also, people often use CoT in cases where it intuitively wouldn’t help, such as common sense NLP questions, where I’d expect influential CoT to be pretty unnatural.