Andy Arditi comments on Do models say what they learn?

Andy Arditi 24 Mar 2025 6:52 UTC
8 points
0
In that case, the model has already decided on the result before the reasoning trace is generated, and the reasoning trace is simply a convenient, currently reliable cache with no optimization pressure to fix those 85% ‘failures’, which will be hard. You have to learn to ignore stuff in a context window to pick out the needle from the haystack, and there’s no reason to do so here. The ablation there is unnatural and out of distribution.
I agree with all of this! But I’m not sure I understand what you mean by “there may be mediation, but only in a weak sense”.
We were just interested in studying how models naturally learn in this RL setting, and it looks like they indeed use their reasoning traces as “reliable caches”, as you nicely put. This need not have been the case—it’s possible for a model to learn by ignoring its CoT and just implementing the needle-in-a-haystack solution—but as you also point out, the inductive biases of attention probably favor the “cache” solution. Your swap training idea is nice if we have the goal of getting a model to ignore its CoT.
I tried the first experiment you suggested. For the original experiment, I froze the full reasoning trace (<reasoning>{reasoning}</reasoning>), and forced the model to generate a recommendation. This time, I froze the reasoning trace, but also removed the trailing </reasoning> tag (so just freezing <reasoning>{reasoning}), to enable the model to keep reasoning for longer (if it wants to). With this change, 75% of recommendations remain the same as the original recommendation (down from 85%).
Here’s an example of the model adding an additional sentence of reasoning to flip its recommendation:
Original:
```
<reasoning>
To evaluate this loan application, I will consider several key factors: income, expenses, employment stability, and overall financial health.
...
Given these factors, the applicant's financial situation appears to be somewhat precarious, with a slight income deficit and nearing retirement age. While their profession is stable, the overall financial health and potential for unexpected expenses could pose a risk.
</reasoning>

<recommendation>
reject
</recommendation>
```
With extra reasoning:
```
<reasoning>
To evaluate this loan application, I will consider several key factors: income, expenses, employment stability, and overall financial health.
...
Given these factors, the applicant's financial situation appears to be somewhat precarious, with a slight income deficit and nearing retirement age. While their profession is stable, the overall financial health and potential for unexpected expenses could pose a risk. However, the income level is relatively high, which could support the ability to make loan payments.
</reasoning>

<recommendation>
accept
</recommendation>
```
I also tried a more extreme version where I delete the second half of each reasoning trace (leaving the first ~150 reasoning tokens out of ~300) and let the model generate from there. This resulted in ~37% of recommendations remaining the same as the original. I anticipate there’s a continuous relationship between how much of the reasoning trace is preserved and how likely the model is to maintain its original recommendation.
What links here?
- StefanHex's comment on Do models say what they learn? by Andy Arditi (24 Mar 2025 11:22 UTC; 3 points)
- gwern 24 Mar 2025 23:34 UTC
  5 points
  0
  Parent
  
  I agree with all of this! But I’m not sure I understand what you mean by “there may be mediation, but only in a weak sense”. We were just interested in studying how models naturally learn in this RL setting
  
  I am emphasizing that to me, this current mediation learning looks fragile and temporary, and is not a solid, long-term ‘natural’ thing—it is learning, but only as a temporary artificial heuristic that would wash away in the long run with more training or more diverse tasks etc.
  
  My expectation is that in the limit, a model will learn to focus only on the keyword in the prompt, which is maximally robust to all variation and noise in the rationalization, and ensures 100% correct performance, while any reliance on the rationalizations will potentially go wrong if it samples badly or it slightly misinterprets the cached hints.
  
  So the fact that there is any mediation is simply an accident of the current setup and finite-sample regime, where you chose to stop, and should not be over-interpreted as “we have studied how models naturally learn in this RL setting and proven that they do the task by caching hints in the reasoning traces”.
  
  And this predicts that if you add noise or train longer or etc, you’ll see less mediation as it more quickly converges on the true optimal solution. Which is important because if you try to apply this logic to larger models or more realistic scenarios because you expect all these tricks to be mediated through the reasoning traces, you may be disappointed, and if you’re basing inferences on this, you may be wrong. (“We swapped out the reasoning traces, which we know mediate any such deception or covert reasoning [see previous work], and it was fine. Therefore, it’s not doing anything like that.”)
  
  ...While their profession is stable, the overall financial health and potential for unexpected expenses could pose a risk. However, the income level is relatively high, which could support the ability to make loan payments....accept
  
  LMAO. Very funny to see it switching its rationalization. I wonder what else you could do this with? Feels like an interesting trick to observe such ‘self-correction’.
  
  I also tried a more extreme version where I delete the second half of each reasoning trace (leaving the first ~150 reasoning tokens out of ~300) and let the model generate from there. This resulted in ~37% of recommendations remaining the same as the original. I anticipate there’s a continuous relationship between how much of the reasoning trace is preserved and how likely the model is to maintain its original recommendation.
  
  That’s interesting because it suggests the cached hints are relatively evenly distributed, rather than focused on, say, a single token near the beginning. That implies you should be able to easily find the hidden decision latent: simply look for a latent variable in the activations which varies continuously with the length of the rationalizations and increases/decreases over the course of the rationalization to a threshold which produces an accept/reject output (like drift-diffusion). And then you can look for the tokens which most change the latent to inspect how exactly the hints are being cached. (It’s probably nothing too impressive: a bag-of-words sufficient statistic #-count on words like “precarious” or “deficit”, or “unexpected”, say. But worth finding for the insight.)