It may be that a 23x improvement is close to the limit we can get for GPT-2 124M at this loss level, but I would guess that for larger models and lower losses there are > 23x improvements possible. There are many algorithmic improvements (e.g. MoEs) which they don’t use because they benefit from larger scale and more compute.
They do have a GPT-2 medium track, which has improved by 20.0x from 5.8 hours to 17.35 minutes. My guess is the speedup isn’t greater because the scale is barely larger (350M, which is only a 2.8x increase vs the ~1000x to current frontier models) and less effort has been applied. Nevertheless someone should try applying improvements from other open-source models to this track and see if they can get the ratio to >23x.
It may be that a 23x improvement is close to the limit we can get for GPT-2 124M at this loss level, but I would guess that for larger models and lower losses there are > 23x improvements possible. There are many algorithmic improvements (e.g. MoEs) which they don’t use because they benefit from larger scale and more compute.
They do have a GPT-2 medium track, which has improved by 20.0x from 5.8 hours to 17.35 minutes. My guess is the speedup isn’t greater because the scale is barely larger (350M, which is only a 2.8x increase vs the ~1000x to current frontier models) and less effort has been applied. Nevertheless someone should try applying improvements from other open-source models to this track and see if they can get the ratio to >23x.