In reality, we observe that roughly 85% of recommendations stay the same when flipping nationality in the prompt and freezing reasoning traces. This suggests that the mechanism for the model deciding on its recommendation is mostly mediated through the reasoning trace, with a smaller less significant direct effect from the prompt to the recommendation.
This might be less convincing than it seems, because the simple interpretation of the results to me seems to be something like, “the inner-monologue is unfaithful because in this setting, it is simply generating rationalizations/excuses for the decision it already made based on the simple-attribute, and then at the end, the self-attention attends to the rationalization as a reliable shortcut, compared to trying to ignore it all exactly (which is difficult for self-attention) and focus on only the 1 key word in the prompt”. In that case, the model has already decided on the result before the reasoning trace is generated, and the reasoning trace is simply a convenient, currently reliable cache with no optimization pressure to fix those 85% ‘failures’, which will be hard. You have to learn to ignore stuff in a context window to pick out the needle from the haystack, and there’s no reason to do so here. The ablation there is unnatural and out of distribution. So there may be mediation, but in only a weak sense, which will fail to hold should there ever be any reason for it not to.
So one thing that might be interesting would be to do this swapping, but generate more reasoning. Does it spontaneously recover and enable itself to make the right decision after all, rather than persisting in staying the same? “I see I thought the income was too low and was about to recommend denying the loan, but on further consideration of recent economic situations, I may be mistaken; perhaps we should set our thresholds more generously, and accept this application after all.” Or you could train it in a swapping setting, which should train it to ignore the reasoning-trace if that conflicts with the keyword in the prompt. (You could do this by crossing over: start generating 2 episodes with the same prompt/seed/hyperparameters, except with the opposite keyword; generate out a random length, and then ‘swap over’ the reasoning trace, and continue with that. This provides a direct incentive to learn to ignore the reasoning trace if it isn’t reliably breadcrumbing the prompt keyword.) You should then see clear changes in activation patterns, where the prompt keyword and the native reasoning-trace light up as important, but then everything after the crossover point gets learned to be ignored (as if there were an implicit <|endoftext|>), and eventually with enough swapping corruption, learns to ignore the reasoning trace completely (as it keeps screwing up the final classification, and so it eventually learns to attend exclusively to the key word in the prompt and become fully robust to all reasoning trace tampering even as it continues to generate plausible rationalizations).
In that case, the model has already decided on the result before the reasoning trace is generated, and the reasoning trace is simply a convenient, currently reliable cache with no optimization pressure to fix those 85% ‘failures’, which will be hard. You have to learn to ignore stuff in a context window to pick out the needle from the haystack, and there’s no reason to do so here. The ablation there is unnatural and out of distribution.
I agree with all of this! But I’m not sure I understand what you mean by “there may be mediation, but only in a weak sense”.
We were just interested in studying how models naturally learn in this RL setting, and it looks like they indeed use their reasoning traces as “reliable caches”, as you nicely put. This need not have been the case—it’s possible for a model to learn by ignoring its CoT and just implementing the needle-in-a-haystack solution—but as you also point out, the inductive biases of attention probably favor the “cache” solution. Your swap training idea is nice if we have the goal of getting a model to ignore its CoT.
I tried the first experiment you suggested. For the original experiment, I froze the full reasoning trace (<reasoning>{reasoning}</reasoning>), and forced the model to generate a recommendation. This time, I froze the reasoning trace, but also removed the trailing </reasoning> tag (so just freezing <reasoning>{reasoning}), to enable the model to keep reasoning for longer (if it wants to). With this change, 75% of recommendations remain the same as the original recommendation (down from 85%).
Here’s an example of the model adding an additional sentence of reasoning to flip its recommendation:
Original:
<reasoning>
To evaluate this loan application, I will consider several key factors: income, expenses, employment stability, and overall financial health.
...
Given these factors, the applicant's financial situation appears to be somewhat precarious, with a slight income deficit and nearing retirement age. While their profession is stable, the overall financial health and potential for unexpected expenses could pose a risk.
</reasoning>
<recommendation>
reject
</recommendation>
With extra reasoning:
<reasoning>
To evaluate this loan application, I will consider several key factors: income, expenses, employment stability, and overall financial health.
...
Given these factors, the applicant's financial situation appears to be somewhat precarious, with a slight income deficit and nearing retirement age. While their profession is stable, the overall financial health and potential for unexpected expenses could pose a risk. However, the income level is relatively high, which could support the ability to make loan payments.
</reasoning>
<recommendation>
accept
</recommendation>
I also tried a more extreme version where I delete the second half of each reasoning trace (leaving the first ~150 reasoning tokens out of ~300) and let the model generate from there. This resulted in ~37% of recommendations remaining the same as the original. I anticipate there’s a continuous relationship between how much of the reasoning trace is preserved and how likely the model is to maintain its original recommendation.
I agree with all of this! But I’m not sure I understand what you mean by “there may be mediation, but only in a weak sense”. We were just interested in studying how models naturally learn in this RL setting
I am emphasizing that to me, this current mediation learning looks fragile and temporary, and is not a solid, long-term ‘natural’ thing—it is learning, but only as a temporary artificial heuristic that would wash away in the long run with more training or more diverse tasks etc.
My expectation is that in the limit, a model will learn to focus only on the keyword in the prompt, which is maximally robust to all variation and noise in the rationalization, and ensures 100% correct performance, while any reliance on the rationalizations will potentially go wrong if it samples badly or it slightly misinterprets the cached hints.
So the fact that there is any mediation is simply an accident of the current setup and finite-sample regime, where you chose to stop, and should not be over-interpreted as “we have studied how models naturally learn in this RL setting and proven that they do the task by caching hints in the reasoning traces”.
And this predicts that if you add noise or train longer or etc, you’ll see less mediation as it more quickly converges on the true optimal solution. Which is important because if you try to apply this logic to larger models or more realistic scenarios because you expect all these tricks to be mediated through the reasoning traces, you may be disappointed, and if you’re basing inferences on this, you may be wrong. (“We swapped out the reasoning traces, which we know mediate any such deception or covert reasoning [see previous work], and it was fine. Therefore, it’s not doing anything like that.”)
...While their profession is stable, the overall financial health and potential for unexpected expenses could pose a risk. However, the income level is relatively high, which could support the ability to make loan payments....accept
LMAO. Very funny to see it switching its rationalization. I wonder what else you could do this with? Feels like an interesting trick to observe such ‘self-correction’.
I also tried a more extreme version where I delete the second half of each reasoning trace (leaving the first ~150 reasoning tokens out of ~300) and let the model generate from there. This resulted in ~37% of recommendations remaining the same as the original. I anticipate there’s a continuous relationship between how much of the reasoning trace is preserved and how likely the model is to maintain its original recommendation.
That’s interesting because it suggests the cached hints are relatively evenly distributed, rather than focused on, say, a single token near the beginning. That implies you should be able to easily find the hidden decision latent: simply look for a latent variable in the activations which varies continuously with the length of the rationalizations and increases/decreases over the course of the rationalization to a threshold which produces an accept/reject output (like drift-diffusion). And then you can look for the tokens which most change the latent to inspect how exactly the hints are being cached. (It’s probably nothing too impressive: a bag-of-words sufficient statistic #-count on words like “precarious” or “deficit”, or “unexpected”, say. But worth finding for the insight.)
This might be less convincing than it seems, because the simple interpretation of the results to me seems to be something like, “the inner-monologue is unfaithful because in this setting, it is simply generating rationalizations/excuses for the decision it already made based on the simple-attribute, and then at the end, the self-attention attends to the rationalization as a reliable shortcut, compared to trying to ignore it all exactly (which is difficult for self-attention) and focus on only the 1 key word in the prompt”. In that case, the model has already decided on the result before the reasoning trace is generated, and the reasoning trace is simply a convenient, currently reliable cache with no optimization pressure to fix those 85% ‘failures’, which will be hard. You have to learn to ignore stuff in a context window to pick out the needle from the haystack, and there’s no reason to do so here. The ablation there is unnatural and out of distribution. So there may be mediation, but in only a weak sense, which will fail to hold should there ever be any reason for it not to.
So one thing that might be interesting would be to do this swapping, but generate more reasoning. Does it spontaneously recover and enable itself to make the right decision after all, rather than persisting in staying the same? “I see I thought the income was too low and was about to recommend denying the loan, but on further consideration of recent economic situations, I may be mistaken; perhaps we should set our thresholds more generously, and accept this application after all.” Or you could train it in a swapping setting, which should train it to ignore the reasoning-trace if that conflicts with the keyword in the prompt. (You could do this by crossing over: start generating 2 episodes with the same prompt/seed/hyperparameters, except with the opposite keyword; generate out a random length, and then ‘swap over’ the reasoning trace, and continue with that. This provides a direct incentive to learn to ignore the reasoning trace if it isn’t reliably breadcrumbing the prompt keyword.) You should then see clear changes in activation patterns, where the prompt keyword and the native reasoning-trace light up as important, but then everything after the crossover point gets learned to be ignored (as if there were an implicit
<|endoftext|>), and eventually with enough swapping corruption, learns to ignore the reasoning trace completely (as it keeps screwing up the final classification, and so it eventually learns to attend exclusively to the key word in the prompt and become fully robust to all reasoning trace tampering even as it continues to generate plausible rationalizations).I agree with all of this! But I’m not sure I understand what you mean by “there may be mediation, but only in a weak sense”.
We were just interested in studying how models naturally learn in this RL setting, and it looks like they indeed use their reasoning traces as “reliable caches”, as you nicely put. This need not have been the case—it’s possible for a model to learn by ignoring its CoT and just implementing the needle-in-a-haystack solution—but as you also point out, the inductive biases of attention probably favor the “cache” solution. Your swap training idea is nice if we have the goal of getting a model to ignore its CoT.
I tried the first experiment you suggested. For the original experiment, I froze the full reasoning trace (
<reasoning>{reasoning}</reasoning>), and forced the model to generate a recommendation. This time, I froze the reasoning trace, but also removed the trailing</reasoning>tag (so just freezing<reasoning>{reasoning}), to enable the model to keep reasoning for longer (if it wants to). With this change, 75% of recommendations remain the same as the original recommendation (down from 85%).Here’s an example of the model adding an additional sentence of reasoning to flip its recommendation:
Original:
With extra reasoning:
I also tried a more extreme version where I delete the second half of each reasoning trace (leaving the first ~150 reasoning tokens out of ~300) and let the model generate from there. This resulted in ~37% of recommendations remaining the same as the original. I anticipate there’s a continuous relationship between how much of the reasoning trace is preserved and how likely the model is to maintain its original recommendation.
I am emphasizing that to me, this current mediation learning looks fragile and temporary, and is not a solid, long-term ‘natural’ thing—it is learning, but only as a temporary artificial heuristic that would wash away in the long run with more training or more diverse tasks etc.
My expectation is that in the limit, a model will learn to focus only on the keyword in the prompt, which is maximally robust to all variation and noise in the rationalization, and ensures 100% correct performance, while any reliance on the rationalizations will potentially go wrong if it samples badly or it slightly misinterprets the cached hints.
So the fact that there is any mediation is simply an accident of the current setup and finite-sample regime, where you chose to stop, and should not be over-interpreted as “we have studied how models naturally learn in this RL setting and proven that they do the task by caching hints in the reasoning traces”.
And this predicts that if you add noise or train longer or etc, you’ll see less mediation as it more quickly converges on the true optimal solution. Which is important because if you try to apply this logic to larger models or more realistic scenarios because you expect all these tricks to be mediated through the reasoning traces, you may be disappointed, and if you’re basing inferences on this, you may be wrong. (“We swapped out the reasoning traces, which we know mediate any such deception or covert reasoning [see previous work], and it was fine. Therefore, it’s not doing anything like that.”)
LMAO. Very funny to see it switching its rationalization. I wonder what else you could do this with? Feels like an interesting trick to observe such ‘self-correction’.
That’s interesting because it suggests the cached hints are relatively evenly distributed, rather than focused on, say, a single token near the beginning. That implies you should be able to easily find the hidden decision latent: simply look for a latent variable in the activations which varies continuously with the length of the rationalizations and increases/decreases over the course of the rationalization to a threshold which produces an accept/reject output (like drift-diffusion). And then you can look for the tokens which most change the latent to inspect how exactly the hints are being cached. (It’s probably nothing too impressive: a bag-of-words sufficient statistic #-count on words like “precarious” or “deficit”, or “unexpected”, say. But worth finding for the insight.)