Am I right in thinking that, according to your theory, the “fix” they did (restarting training from checkpoint 100 steps before the spike started, but with different data, to avoid the spike) is actually counterproductive because it’s preventing the model from grokking? And instead they should have just kept training to “push through the spike” and get to a new, lower-loss regime?
Now I’m not saying it’s anthropic pressure, but if that’s true maybe we shouldn’t just keep training until we know what exactly it is that the model is grokking.
Whatever is happening, I’m really concerned about the current “sufficiently big model starts to exhibit <weird behaviour A>. I don’t understand, but also don’t care, here is a dirty workaround and just give it more compute lol” paradigm. I don’t think this is very safe.
Am I right in thinking that, according to your theory, the “fix” they did (restarting training from checkpoint 100 steps before the spike started, but with different data, to avoid the spike) is actually counterproductive because it’s preventing the model from grokking? And instead they should have just kept training to “push through the spike” and get to a new, lower-loss regime?
Now I’m not saying it’s anthropic pressure, but if that’s true maybe we shouldn’t just keep training until we know what exactly it is that the model is grokking.
Whatever is happening, I’m really concerned about the current “sufficiently big model starts to exhibit <weird behaviour A>. I don’t understand, but also don’t care, here is a dirty workaround and just give it more compute lol” paradigm. I don’t think this is very safe.
If I could get people to change that paradigm, you bet I would.