Some observations (not particularly constructive):
Training compute is relevant for the most compute-hungry models, because there it can be a taut constraint (and even then only when there would be inference hardware to serve the model once it’s trained, which isn’t always the case). For smaller and catch-up models, other things become more relevant as constraints, and even the relevant compute needed to make them well is no longer training compute, but research compute, or the compute that went into earlier or larger models that contributed to making the smaller model possible.
Inference cost/speed doesn’t have this issue for the smaller models, it remains both relevant and legible. Research compute and compute for implicit predecessor dependencies needed to develop smaller models are still relevant for the reproduce-much-cheaper hypothetical, but it’s far less legible.
Measuring parity based on benchmarks is suspect (even as pragmatically it’s hard to use anything else), a big confused pre-RLVR model and a small competition-minded post-RLVR model will be doing things differently.
Some observations (not particularly constructive):
Training compute is relevant for the most compute-hungry models, because there it can be a taut constraint (and even then only when there would be inference hardware to serve the model once it’s trained, which isn’t always the case). For smaller and catch-up models, other things become more relevant as constraints, and even the relevant compute needed to make them well is no longer training compute, but research compute, or the compute that went into earlier or larger models that contributed to making the smaller model possible.
Inference cost/speed doesn’t have this issue for the smaller models, it remains both relevant and legible. Research compute and compute for implicit predecessor dependencies needed to develop smaller models are still relevant for the reproduce-much-cheaper hypothetical, but it’s far less legible.
Measuring parity based on benchmarks is suspect (even as pragmatically it’s hard to use anything else), a big confused pre-RLVR model and a small competition-minded post-RLVR model will be doing things differently.