I thought “A Theory of Deep Learning” by Elon Litman was interesting, with its approach to only update parameters “if the batch signal on a parameter exceeds its leave-one-out noise, update it; if not, skip it”. The claim is that this accelerates grokking by 5x, among other things. Unfortunately, when I tried it on a multi-step reasoning task, it made it significantly worse at grokking and much more likely to memorize a composed lookup table. In my basic experiment, the model learned a multi-step algorithm 100% of the time using normal AdamW and 0% of the time with the update rule added. Claude has some opinions about why the update rule is counterproductive for grokking multi-step algorithms but I don’t really understand it; I just thought this was an interesting data point.
I thought “A Theory of Deep Learning” by Elon Litman was interesting, with its approach to only update parameters “if the batch signal on a parameter exceeds its leave-one-out noise, update it; if not, skip it”. The claim is that this accelerates grokking by 5x, among other things. Unfortunately, when I tried it on a multi-step reasoning task, it made it significantly worse at grokking and much more likely to memorize a composed lookup table. In my basic experiment, the model learned a multi-step algorithm 100% of the time using normal AdamW and 0% of the time with the update rule added. Claude has some opinions about why the update rule is counterproductive for grokking multi-step algorithms but I don’t really understand it; I just thought this was an interesting data point.