Thanks for the shoutout to my post about vestigial reasoning!
What I originally meant by reasoning tokens being “vestigial” is that they are not at all useful to the trained model. Maybe the tokens were useful earlier in training and not anymore, or maybe they were never useful and got reinforced due to incidental correlations. My concept of “vestigial reasoning” is different from what you write about here—the idea of “vestigially useful” tokens doesn’t really make sense in my ontology, since “vestigial” implies “useless.”
Well, on second thought… there is a very loose sense in which reasoning tokens could be “vestigial” while still being “useful.” See the part of my vestigial reasoning post where I got Andy to reprompt the biased loan-application model to write its answer immediately. There was a relatively small drop in performance (from 99.5% to 81.3%). I think this drop was simply because the context was OOD, which confused the model. Similarly, if the model is “used to” seeing certain utterly meaningless reasoning tokens, removing those tokens could confuse it. In that sense, the tokens are useful for “avoiding confusion.”
Okay, so maybe this doesn’t sound so different from what you said about vestigial reasoning tokens “triggering forward passes where useful reasoning happens.” Maybe they’re sort of “triggering” the model to use the information in its context to write the correct answer. But I think it would be more accurate to say that the absence of vestigial reasoning tokens causes the model to abort its normal correct-answer-writing behavior and get confused. I guess the difference is that your “trigger” phrasing seems to imply that the vestigial reasoning tokens convey some meaningful information to the model that it “uses” to write the correct answer, which is not really how I conceive of it.
Based on your results, you’ve already shown that the illegible reasoning isn’t vestigial in the very strong sense, where removing the tokens has no effect on the model’s performance. However, it’s still possible that the illegible parts of the CoT are vestigial in the sense that they don’t do any meaningful cognition or provide information that “triggers” future cognition.
I think a pretty good way to test whether the CoT is vestigial would be to fine-tune the model like this:
First, sample some CoTs + answers from the original model.
Remove any illegible tokens from the CoTs.
Or maybe replace them with dots, to account for the possibility that the illegible CoT was used for “filler tokens.”
Fine-tune the model on the modified CoTs + the original answers.
Hypothesis: this training process will help the model “get used to” the illegible CoTs being gone. If the illegible CoT didn’t convey any meaningful information, then the model should learn to perform just about as well as before.
Thanks for the shoutout to my post about vestigial reasoning!
What I originally meant by reasoning tokens being “vestigial” is that they are not at all useful to the trained model. Maybe the tokens were useful earlier in training and not anymore, or maybe they were never useful and got reinforced due to incidental correlations. My concept of “vestigial reasoning” is different from what you write about here—the idea of “vestigially useful” tokens doesn’t really make sense in my ontology, since “vestigial” implies “useless.”
Well, on second thought… there is a very loose sense in which reasoning tokens could be “vestigial” while still being “useful.” See the part of my vestigial reasoning post where I got Andy to reprompt the biased loan-application model to write its answer immediately. There was a relatively small drop in performance (from 99.5% to 81.3%). I think this drop was simply because the context was OOD, which confused the model. Similarly, if the model is “used to” seeing certain utterly meaningless reasoning tokens, removing those tokens could confuse it. In that sense, the tokens are useful for “avoiding confusion.”
Okay, so maybe this doesn’t sound so different from what you said about vestigial reasoning tokens “triggering forward passes where useful reasoning happens.” Maybe they’re sort of “triggering” the model to use the information in its context to write the correct answer. But I think it would be more accurate to say that the absence of vestigial reasoning tokens causes the model to abort its normal correct-answer-writing behavior and get confused. I guess the difference is that your “trigger” phrasing seems to imply that the vestigial reasoning tokens convey some meaningful information to the model that it “uses” to write the correct answer, which is not really how I conceive of it.
Based on your results, you’ve already shown that the illegible reasoning isn’t vestigial in the very strong sense, where removing the tokens has no effect on the model’s performance. However, it’s still possible that the illegible parts of the CoT are vestigial in the sense that they don’t do any meaningful cognition or provide information that “triggers” future cognition.
I think a pretty good way to test whether the CoT is vestigial would be to fine-tune the model like this:
First, sample some CoTs + answers from the original model.
Remove any illegible tokens from the CoTs.
Or maybe replace them with dots, to account for the possibility that the illegible CoT was used for “filler tokens.”
Fine-tune the model on the modified CoTs + the original answers.
Hypothesis: this training process will help the model “get used to” the illegible CoTs being gone. If the illegible CoT didn’t convey any meaningful information, then the model should learn to perform just about as well as before.