anaguma comments on Thomas Kwa’s Shortform

anaguma 3 Jan 2026 6:18 UTC
5 points
−1
It may be that a 23x improvement is close to the limit we can get for GPT-2 124M at this loss level, but I would guess that for larger models and lower losses there are > 23x improvements possible. There are many algorithmic improvements (e.g. MoEs) which they don’t use because they benefit from larger scale and more compute.
- Thomas Kwa 3 Jan 2026 7:03 UTC
  4 points
  1
  Parent
  They do have a GPT-2 medium track, which has improved by 20.0x from 5.8 hours to 17.35 minutes. My guess is the speedup isn’t greater because the scale is barely larger (350M, which is only a 2.8x increase vs the ~1000x to current frontier models) and less effort has been applied. Nevertheless someone should try applying improvements from other open-source models to this track and see if they can get the ratio to >23x.