I think this is a great strategy! In particular, older LLMs are much more inefficient during inference, so this also wastes the compute of the scaling labs.
Thanks for the comment! I’m hoping to get some more feedback on this overtime, as there are some more technical questions in my mind as to how to actually pull this off, as well as the theoretical questions relating to whether this would be a good strategy, or whether it would be counter-productive! :)
I think this is a great strategy! In particular, older LLMs are much more inefficient during inference, so this also wastes the compute of the scaling labs.
Thanks for the comment! I’m hoping to get some more feedback on this overtime, as there are some more technical questions in my mind as to how to actually pull this off, as well as the theoretical questions relating to whether this would be a good strategy, or whether it would be counter-productive! :)