gwern comments on Do models say what they learn?

gwern 24 Mar 2025 23:34 UTC
5 points
0

I agree with all of this! But I’m not sure I understand what you mean by “there may be mediation, but only in a weak sense”. We were just interested in studying how models naturally learn in this RL setting

I am emphasizing that to me, this current mediation learning looks fragile and temporary, and is not a solid, long-term ‘natural’ thing—it is learning, but only as a temporary artificial heuristic that would wash away in the long run with more training or more diverse tasks etc.

My expectation is that in the limit, a model will learn to focus only on the keyword in the prompt, which is maximally robust to all variation and noise in the rationalization, and ensures 100% correct performance, while any reliance on the rationalizations will potentially go wrong if it samples badly or it slightly misinterprets the cached hints.

So the fact that there is any mediation is simply an accident of the current setup and finite-sample regime, where you chose to stop, and should not be over-interpreted as “we have studied how models naturally learn in this RL setting and proven that they do the task by caching hints in the reasoning traces”.

And this predicts that if you add noise or train longer or etc, you’ll see less mediation as it more quickly converges on the true optimal solution. Which is important because if you try to apply this logic to larger models or more realistic scenarios because you expect all these tricks to be mediated through the reasoning traces, you may be disappointed, and if you’re basing inferences on this, you may be wrong. (“We swapped out the reasoning traces, which we know mediate any such deception or covert reasoning [see previous work], and it was fine. Therefore, it’s not doing anything like that.”)

...While their profession is stable, the overall financial health and potential for unexpected expenses could pose a risk. However, the income level is relatively high, which could support the ability to make loan payments....accept

LMAO. Very funny to see it switching its rationalization. I wonder what else you could do this with? Feels like an interesting trick to observe such ‘self-correction’.

I also tried a more extreme version where I delete the second half of each reasoning trace (leaving the first ~150 reasoning tokens out of ~300) and let the model generate from there. This resulted in ~37% of recommendations remaining the same as the original. I anticipate there’s a continuous relationship between how much of the reasoning trace is preserved and how likely the model is to maintain its original recommendation.

That’s interesting because it suggests the cached hints are relatively evenly distributed, rather than focused on, say, a single token near the beginning. That implies you should be able to easily find the hidden decision latent: simply look for a latent variable in the activations which varies continuously with the length of the rationalizations and increases/decreases over the course of the rationalization to a threshold which produces an accept/reject output (like drift-diffusion). And then you can look for the tokens which most change the latent to inspect how exactly the hints are being cached. (It’s probably nothing too impressive: a bag-of-words sufficient statistic #-count on words like “precarious” or “deficit”, or “unexpected”, say. But worth finding for the insight.)