Rohin Shah comments on QAPR 5: grokking is maybe not that big a deal?

Rohin Shah 25 Jul 2023 6:21 UTC
LW: 2 AF: 2
0
AF
~~Honestly I’d be surprised if you could achieve (2) even with explicit regularization, specifically on the modular addition task.~~
(You can achieve it by initializing the token embeddings to those of a grokked network so that the representations are appropriately structured; I’m not allowing things like that.)

EDIT: Actually, Omnigrok does this by constraining the parameter norm. I suspect this is mostly making it very difficult for the network to strongly memorize the data—given the weight decay parameter the network “tries” to learn a high-param norm memorizing solution, but then repeatedly runs into the parameter norm constraint—and so creates a very strong reason for the network to learn the generalizing algorithm. But that should still count as normal regularization.
- RobertKirk 25 Jul 2023 17:31 UTC
  LW: 1 AF: 1
  0
  AF Parent
  If you train on infinite data, I assume you’d not see a delay between training and testing, but you’d expect a non-monotonic accuracy curve that looks kind of like the test accuracy curve in the finite-data regime? So I assume infinite data is also cheating?
  - Rohin Shah 26 Jul 2023 6:49 UTC
    LW: 3 AF: 3
    0
    AF Parent
    I expect a delay even in the infinite data case, I think?
    Although I’m not quite sure what you mean by “infinite data” here—if the argument is that every data point will have been seen during training, then I agree that there won’t be any delay. But yes training on the test set (even via “we train on everything so there is no possible test set”) counts as cheating for this purpose.

Rohin Shah comments on QAPR 5: grokking is maybe not *that* big a deal?

Rohin Shah comments on QAPR 5: grokking is maybe not that big a deal?