I believe this is mistaken (I’m the first author from the paper). For Claude 4 Sonnet, we used the reasoning tokens provided from the API. The other tested models do not provide reasoning tokens, so we instead had the models just output a chain of thought.
I think this could be better communicated in the paper—we only performed this step of analyzing the RL trained reasoning in the Chain of Thought faithfulness section, as most models used in the paper do not provide the reasoning tokens.
I believe this is mistaken (I’m the first author from the paper). For Claude 4 Sonnet, we used the reasoning tokens provided from the API. The other tested models do not provide reasoning tokens, so we instead had the models just output a chain of thought.
I think this could be better communicated in the paper—we only performed this step of analyzing the RL trained reasoning in the Chain of Thought faithfulness section, as most models used in the paper do not provide the reasoning tokens.
Thanks for the clarification. It seems I got a bit confused.