Possibly the method will underestimate parameter count as time goes on. I don’t expect it to be economically valuable to pretrain on the very long tails of knowledge, as opposed to letting more bits flow in from synthetic data / RLVR. Though I’m surprised as to why this hasn’t already happened.
Possibly the method will underestimate parameter count as time goes on. I don’t expect it to be economically valuable to pretrain on the very long tails of knowledge, as opposed to letting more bits flow in from synthetic data / RLVR. Though I’m surprised as to why this hasn’t already happened.