Thanks, this is an interesting result. If you’re planning on doing additional experiments here, I think the ones I’d find most valuable are:
Studying how this varies with model scale.
Studying whether this effect occurs more/less when the model is told (e.g. in a system prompt) that it might be penalized for admitting information in its final output but not in its CoT.
The reason I’d find these follow-ups informative is that I expect spill-over to be weaker for models that more strongly distinguish between the text in the CoT vs. final response.
In other words, if a model doesn’t understand these two parts of its response as different in any way, then we should strongly expect spillover (since it’s all just “the response”). I suspect this is the case for the model you’re working with here (Qwen3-4B) since it seems quite small.
Thanks, this is an interesting result. If you’re planning on doing additional experiments here, I think the ones I’d find most valuable are:
Studying how this varies with model scale.
Studying whether this effect occurs more/less when the model is told (e.g. in a system prompt) that it might be penalized for admitting information in its final output but not in its CoT.
The reason I’d find these follow-ups informative is that I expect spill-over to be weaker for models that more strongly distinguish between the text in the CoT vs. final response.
In other words, if a model doesn’t understand these two parts of its response as different in any way, then we should strongly expect spillover (since it’s all just “the response”). I suspect this is the case for the model you’re working with here (Qwen3-4B) since it seems quite small.