I think that post has a lot of good ideas, e.g. the idea that generalizing circuits get reinforced by SGD more than memorizing circuits at least rhymes with what we claim is actually going on (that generalizing circuits are more efficient at producing strong logits with small param norm). We probably should have cited it, I forgot that it existed.
But it is ultimately a different take and one that I think ends up being wrong (e.g. I think it would struggle to explain semi-grokking).
I also think my early explanation, which that post compares to, is basically as good or better in hindsight, e.g.:
My early explanation says that memorization produces less confident probabilities than generalization. Quintin’s post explicitly calls this out as a difference, I continued to endorse my position in the comments. In hindsight my explanation was right and Quintin’s was wrong, at least if you believe our new paper.
My early explanation relies on the assumption that there are no “intermediate” circuits, only a pure memorization and pure generalization circuit that you can interpolate between. Again, this is called out as a difference by Quintin’s post, and I continued to endorse my position in the comments. Again, I think in hindsight my explanation was right and Quintin’s was wrong (though this is less clearly implied by our paper, and I could still imagine future evidence overturning that conclusion, though I really don’t expect that to happen).
On the other hand, my early explanation involves a random walk to the generalizing circuit, whereas in reality it develops smoothly over time. In hindsight, my explanation was wrong, and Quintin’s is correct.
See also this post by Quintin Pope: https://www.lesswrong.com/posts/JFibrXBewkSDmixuo/hypothesis-gradient-descent-prefers-general-circuits
I think that post has a lot of good ideas, e.g. the idea that generalizing circuits get reinforced by SGD more than memorizing circuits at least rhymes with what we claim is actually going on (that generalizing circuits are more efficient at producing strong logits with small param norm). We probably should have cited it, I forgot that it existed.
But it is ultimately a different take and one that I think ends up being wrong (e.g. I think it would struggle to explain semi-grokking).
I also think my early explanation, which that post compares to, is basically as good or better in hindsight, e.g.:
My early explanation says that memorization produces less confident probabilities than generalization. Quintin’s post explicitly calls this out as a difference, I continued to endorse my position in the comments. In hindsight my explanation was right and Quintin’s was wrong, at least if you believe our new paper.
My early explanation relies on the assumption that there are no “intermediate” circuits, only a pure memorization and pure generalization circuit that you can interpolate between. Again, this is called out as a difference by Quintin’s post, and I continued to endorse my position in the comments. Again, I think in hindsight my explanation was right and Quintin’s was wrong (though this is less clearly implied by our paper, and I could still imagine future evidence overturning that conclusion, though I really don’t expect that to happen).
On the other hand, my early explanation involves a random walk to the generalizing circuit, whereas in reality it develops smoothly over time. In hindsight, my explanation was wrong, and Quintin’s is correct.