Thanks @Bronson Schoen, this is a great point. The results section (6.1) does validate your concern—on some tasks (like GPQA with hints, where a ‘hint’ in the prompt gives away the answer), CoT shortened and monitorability dropped over training. The paper emphasizes the average, which seems to mask degradation on tasks where reasoning isn’t strictly necessary. I’ve updated the summary to be more measured, and added a footnote. It is now:
Second, the paper tracked two frontier runs (GPT 5.1 and GPT o3) and concluded that the training process doesn’t materially degrade monitorability, at least not at current scales. But it also found that on some tasks, where the models could get away with less reasoning, the chain of thought got shorter over the course of training, and monitorability dropped. This is what you’d expect to see first: degradation on tasks where reasoning is optional—the set of which only grows.[5]
I’m not sure, but want to find out! I find these model introspection papers fascinating, and want to write about this next.