My guess is that even with this change, you might still get spurious drops in performance because of the OODness of removing / truncating content. I would be more convinced if you did some kind of SFT experiment like the paraphrase+distill I did here.
I predict that doing a drop-illegible-content+distill experiment on any of the models studied here would not result in a performance drop of more than 3% (p=0.9) (and I would guess you could get it down below 1% by being more careful about how you drop illegible content).
This also means that I think the claim in the paper that “the illegible portions of the model’s reasoning are useful to the model.” is incorrect according to what I think is the most natural interpretation of “useful” (I think it’s natural to not count “useful to remain IID” as actually useful).
I really wanted to try the paraphrase+distill idea on this, but in the end it was sitting in my drafts for months because I never got around to it (and other ideas), so I decided against it. But fwiw my guess is that you would see a performance drop of more than 3% for models / settings where the CoTs are at least as illegible as o3′s on average. Certainly less than the drop I show here though. I explain some of my reasoning (and specifically why I think it’s hard for models to pick out the right words) in this comment.
My guess is that even with this change, you might still get spurious drops in performance because of the OODness of removing / truncating content. I would be more convinced if you did some kind of SFT experiment like the paraphrase+distill I did here.
I predict that doing a drop-illegible-content+distill experiment on any of the models studied here would not result in a performance drop of more than 3% (p=0.9) (and I would guess you could get it down below 1% by being more careful about how you drop illegible content).
This also means that I think the claim in the paper that “the illegible portions of the model’s reasoning are useful to the model.” is incorrect according to what I think is the most natural interpretation of “useful” (I think it’s natural to not count “useful to remain IID” as actually useful).
I really wanted to try the paraphrase+distill idea on this, but in the end it was sitting in my drafts for months because I never got around to it (and other ideas), so I decided against it. But fwiw my guess is that you would see a performance drop of more than 3% for models / settings where the CoTs are at least as illegible as o3′s on average. Certainly less than the drop I show here though. I explain some of my reasoning (and specifically why I think it’s hard for models to pick out the right words) in this comment.