Thank you for publishing it, it is interesting! Especially the distinction between internal and external CoTs. Could you elaborate how did you extract internal CoTs? I checked your code at here, and it looks like you took CoTs before </think>. As I understand, with Claude series their “internal” CoTs might be summarized by a smaller model. From their system card (https://www.anthropic.com/news/claude-sonnet-4-5):
As with Claude Sonnet 4 and Claude Opus 4, thought processes from Claude Sonnet 4.5 are summarized by an additional, smaller model if they extend beyond a certain point (that is, after this point the “raw” thought process is no longer shown to the user).
So your fig. 2 with 56.8 gap for Sonnet 4.5 might be misleading.
Thank you for publishing it, it is interesting! Especially the distinction between internal and external CoTs. Could you elaborate how did you extract internal CoTs? I checked your code at here, and it looks like you took CoTs before
</think>. As I understand, with Claude series their “internal” CoTs might be summarized by a smaller model. From their system card (https://www.anthropic.com/news/claude-sonnet-4-5):So your fig. 2 with 56.8 gap for Sonnet 4.5 might be misleading.