It would be interesting to see how these change throughout training. AFAIK GPT-2s do not have saved checkpoints, but eg Pythia does and has an even broader range of parameter sizes than GPT-2s.
It would be interesting to see how these change throughout training. AFAIK GPT-2s do not have saved checkpoints, but eg Pythia does and has an even broader range of parameter sizes than GPT-2s.