That proposal seems like it’d mitigate the problem somewhat, but even then I think there are nonobvious paths to end up with significant pressure on the CoT. For example, I frequently will ask a question, then watch the CoT summaries to see how the model is approaching the problem, and stop the response and edit my prompt and retry if it’s going down a path I don’t like. If someone has the bright idea to penalize trajectories which resulted in the user cancelling the request partway through, you’ve now got pressure on the CoT.
I think there are a bunch of weird side channels like this, to the point where I’m not sure that companies know how to pay the associated safety tax even if they want to.
Having trouble getting Opus 4.7 to guess who I am from a few paragraphs of writing, even to the point of my name being in a top-10-guesses list. But I was able to get GPT 4.5 to do this a year ago so that capability might vary author-to-author and model-to-model.