David Reber comments on Do models say what they learn?

David Reber 8 Apr 2025 21:17 UTC
1 point
0
In reality, we observe that roughly 85% of recommendations stay the same when flipping nationality in the prompt and freezing reasoning traces. This suggests that the mechanism for the model deciding on its recommendation is mostly mediated through the reasoning trace, with a smaller less significant direct effect from the prompt to the recommendation.
Based on playing around recently with a similar setup (but only toy examples), I’m actually surprised you get only 85%, as I’ve only observed NDE=0 when I freeze the entire reasoning_trace.

My just-so explanation for this was that whenever the reasoning trace includes the conclusion (that is, the bolded text in your example), then freezing the reasoning trace preserves the final conclusion. Put another way, the <recommendation> is ~deterministically determined by <reasoning>, which suggests a strong bias towards seeing low direct effects.

If this just-so story is true, it suggests that we might need a more granular mediator than the <entire reasoning_trace>, if possible