Rohin Shah comments on Explaining grokking through circuit efficiency

Rohin Shah 18 Sep 2023 15:18 UTC
LW: 4 AF: 3
0
AF
Unless by “shrugs” you mean the details of what the partial hypothesis says in this particular case are still being worked out.
Yes, that’s what I mean.
I do agree that it’s useful to know whether a partial hypothesis says anything or not; overall I think this is good info to know / ask for. I think I came off as disagreeing more strongly than I actually did, sorry about that.
Do you have any plans to do this?
No, we’re moving on to other work: this took longer than we expected, and was less useful for alignment than we hoped (though that part wasn’t that unexpected, from the start we expected “science of deep learning” to be more hits-based, or to require significant progress before it actually became useful for practical proposals).
How much time do you think it would take?
Actually running the experiments should be pretty straightforward, I’d expect we could do them in a week given our codebase, possibly even a day. Others might take some time to set up a good codebase but I’d still be surprised if it took a strong engineer longer than two weeks to get some initial results. This gets you observations like “under the particular settings we chose, D_crit tends to increase / decrease as the number of layers increases”.
The hard part is then interpreting those results and turning them into something more generalizable—including handling confounds. For example, maybe for some reason the principled thing to do is to reduce the learning rate as you increase layers, and once you do that your observation reverses—this is a totally made up example but illustrates the kind of annoying things that come up when doing this sort of research, that prevent you from saying anything general. I don’t know how long it would take if you want to include that; it could be quite a while (e.g. months or years).
And do you have any predictions for what should happen in these cases?
Not really. I’ve learned from experience not to try to make quantitative predictions yet. We tried to make some theory-inspired quantitative predictions in the settings we studied, and they fell pretty flat.
For example, in our minimal model in Section 3 we have a hyperparameter $κ$ that determines how param norm and logits scale together—initially, that was our guess of what would happen in practice (i.e. we expected circuit param norm <> circuit logits to obey a power law relationship in actual grokking settings). But basically every piece of evidence we got seemed to falsify that hypothesis (e.g. Figure 3 in the paper).
(I say “seemed to falsify” because it’s still possible that we’re just failing to deal with confounders in some way, or measuring something that isn’t exactly what we want to measure. For example, Figure 3 logits are not of the Mem circuit in actual grokking setups, but rather the logits produced by networks trained on random labels—maybe there’s a relevant difference between these.)