Our counterfactual/resampling approach is pretty similar to the main forking path analyses. We, however, examined sentences instead of tokens and specifically targeted reasoning models. These differences lead to some patterns that differ from the forking paper, which we think are interesting.
Reasoning models produce much longer outputs, even compared to base models told to “Think step by step”; on challenging MMLU questions and R1-Llama-3.3-70B produces CoTs that are ~10x more sentences than base Llama-3.3-70B told to think step-by-step. These longer outputs involve mechanisms like error correction. I suspect that if a base model makes some type of mistake then it is less likely to subsequently backtrack than a reasoning model, which very often backtracks (and for an obvious mistake, it has a very high likelihood of doing so). This type of error correction will presumably have a high impact on what tokens/sentences actually impact the final answer. In addition, I suspect reasoning models might be more likely to structure their CoT with plans/hierarchies. I do not have any evidence for this, but come to think of it, we should test this idea and compare it.
Our analyses, focusing on sentences and reasoning CoT, also allow us to do categorization or clustering approaches. For our work here, we labeled sentences as “plan generation” and so, and we were able to demonstrate how these types of sentences tend to have a large causal effect on the final outcome. Our paper discusses how plan-generation sentences often lead to active-computation (e.g., arithmetic) sentences, and our appendix also includes some light results on this, as a transition matrix between categories. For future work, we are exploring clustering approaches rather than just categorization. The capacity to do this seems more possible with sentences than tokens.
Our counterfactual/resampling approach is pretty similar to the main forking path analyses. We, however, examined sentences instead of tokens and specifically targeted reasoning models. These differences lead to some patterns that differ from the forking paper, which we think are interesting.
Reasoning models produce much longer outputs, even compared to base models told to “Think step by step”; on challenging MMLU questions and R1-Llama-3.3-70B produces CoTs that are ~10x more sentences than base Llama-3.3-70B told to think step-by-step. These longer outputs involve mechanisms like error correction. I suspect that if a base model makes some type of mistake then it is less likely to subsequently backtrack than a reasoning model, which very often backtracks (and for an obvious mistake, it has a very high likelihood of doing so). This type of error correction will presumably have a high impact on what tokens/sentences actually impact the final answer. In addition, I suspect reasoning models might be more likely to structure their CoT with plans/hierarchies. I do not have any evidence for this, but come to think of it, we should test this idea and compare it.
Our analyses, focusing on sentences and reasoning CoT, also allow us to do categorization or clustering approaches. For our work here, we labeled sentences as “plan generation” and so, and we were able to demonstrate how these types of sentences tend to have a large causal effect on the final outcome. Our paper discusses how plan-generation sentences often lead to active-computation (e.g., arithmetic) sentences, and our appendix also includes some light results on this, as a transition matrix between categories. For future work, we are exploring clustering approaches rather than just categorization. The capacity to do this seems more possible with sentences than tokens.