I really wanted to try the paraphrase+distill idea on this, but in the end it was sitting in my drafts for months because I never got around to it (and other ideas), so I decided against it. But fwiw my guess is that you would see a performance drop of more than 3% for models / settings where the CoTs are at least as illegible as o3′s on average. Certainly less than the drop I show here though. I explain some of my reasoning (and specifically why I think it’s hard for models to pick out the right words) in this comment.
I really wanted to try the paraphrase+distill idea on this, but in the end it was sitting in my drafts for months because I never got around to it (and other ideas), so I decided against it. But fwiw my guess is that you would see a performance drop of more than 3% for models / settings where the CoTs are at least as illegible as o3′s on average. Certainly less than the drop I show here though. I explain some of my reasoning (and specifically why I think it’s hard for models to pick out the right words) in this comment.