This is interesting! I think it would be worthwhile to check the robustness of your conclusions by focusing on e.g. modular addition and trying larger/smaller models or changing the weight on the regularization term.
The spikes can be understood using centralizing flows, you may want to plot the top Hessian eigenvalues to clean the plot up:
Feedback on the plots: I would find them easier to read if training and validation were the same line. I.e. dashed lines at beginning for train, switches to solid lines for validation so you can see the full path of the model.
This is interesting! I think it would be worthwhile to check the robustness of your conclusions by focusing on e.g. modular addition and trying larger/smaller models or changing the weight on the regularization term.
The spikes can be understood using centralizing flows, you may want to plot the top Hessian eigenvalues to clean the plot up:
https://centralflows.github.io/
Feedback on the plots: I would find them easier to read if training and validation were the same line. I.e. dashed lines at beginning for train, switches to solid lines for validation so you can see the full path of the model.
Some links you might find useful: https://www.lesswrong.com/posts/jusq6kyZ6XSrtW3Bf/paper-summary-omnigrok-grokking-beyond-algorithmic-data
https://openreview.net/pdf?id=9XFSbDPmdW