Adam Jermyn comments on Tracing the Thoughts of a Large Language Model

Adam Jermyn 28 Mar 2025 16:33 UTC
8 points
0
It’s not so much that we didn’t think models plan ahead in general, as that we had various hypotheses (including “unknown unknowns”) and this kind of planning in poetry wasn’t obviously the best one until we saw the evidence.

[More generally: in Interpretability we often have the experience of being surprised by the specific mechanism a model is using, even though with the benefit of hindsight it seems obvious. E.g. when we did the work for Towards Monosemanticity we were initially quite surprised to see the “the in <context>” features, thought they were indicative of a bug in our setup, and had to spend a while thinking about them and poking around before we realized why the model wanted them (which now feels obvious).]