Both you and Peter have pointed out that one of the cruxes here is how much compute is needed for testing.
I agree that if the process could come up with algorithmic improvements so weak and subtle that the advantage could only be clearly distinguished at the scale of a full multimillion dollar training run, then RSI would likely not take off.
I expect though that the process I describe would find strong improvements, which would be obvious at a 100k param run, and continue showing clear advantage at 1 million, 10 million, 100 million, 1 billion, 10 billion, etc.
In that case, the extrapolation becomes a safe bet, and the compute needed for parallel testing is much lower since you only need to test the small models to figure out what is worth scaling.
Both you and Peter have pointed out that one of the cruxes here is how much compute is needed for testing. I agree that if the process could come up with algorithmic improvements so weak and subtle that the advantage could only be clearly distinguished at the scale of a full multimillion dollar training run, then RSI would likely not take off. I expect though that the process I describe would find strong improvements, which would be obvious at a 100k param run, and continue showing clear advantage at 1 million, 10 million, 100 million, 1 billion, 10 billion, etc. In that case, the extrapolation becomes a safe bet, and the compute needed for parallel testing is much lower since you only need to test the small models to figure out what is worth scaling.