Apparently, similar work on prompt repetition came out a bit before I published this post (but after I ran my experiments) and reproduces the effect. It looks like this paper doesn’t test the “no output / cot prior to answer setting” and instead tests the setting where you use non-reasoning LLMs (e.g. Opus 4.5 without extended thinking enabled).
Apparently, similar work on prompt repetition came out a bit before I published this post (but after I ran my experiments) and reproduces the effect. It looks like this paper doesn’t test the “no output / cot prior to answer setting” and instead tests the setting where you use non-reasoning LLMs (e.g. Opus 4.5 without extended thinking enabled).