An interesting piece of potential evidence in favor of this is that METR time horizons measurements didn’t vary significantly for ChatGPT and Claude models when using a basic scaffold as compared to the specific Claude Code and Codex harnesses.
https://metr.org/notes/2026-02-13-measuring-time-horizon-using-claude-code-and-codex/
An interesting piece of potential evidence in favor of this is that METR time horizons measurements didn’t vary significantly for ChatGPT and Claude models when using a basic scaffold as compared to the specific Claude Code and Codex harnesses.
https://metr.org/notes/2026-02-13-measuring-time-horizon-using-claude-code-and-codex/