I haven’t seen careful analysis of LLMs (probably because they’re newer, so harder to fit a trend), but eyeballing it… Chinchilla by itself must have been a factor-of-4 compute-equivalent improvement at least.
Ok, but discovering the Chinchilla scaling laws is a one time boost to training efficiency. You should expect to repeatedly get 4x improvements because you observed that one.
Ok, but discovering the Chinchilla scaling laws is a one time boost to training efficiency. You should expect to repeatedly get 4x improvements because you observed that one.
Every algorithmic improvement is a one-time boost.